Benchmark Results¶

Generated: 2026-04-17T14:54:21+00:00 Seed: 20260417 Corpus size per row: 500 positive + 500 negative conversations

Detector / variant	Precision	Recall	F1	TP	TN	FN
death_loop / exact paraphrase	1.000	1.000	1.000	500	500	0
death_loop / low paraphrase	1.000	1.000	1.000	500	500	0
death_loop / high paraphrase	0.000	0.000	0.000	0	500	500
silent_churn	1.000	1.000	1.000	500	500	0
escalation_burial	1.000	1.000	1.000	500	500	0

Notes¶

death_loop / exact: bot repeats identical responses. Expected to be trivially detected.
death_loop / low paraphrase: bot repeats with small conversational prefixes (e.g. 'Again, ...'). Lexical similarity preserved; should be reliably detected.
death_loop / high paraphrase: bot rotates through lexically-distinct paraphrases. Known ceiling for lexical default — a semantic similarity backend (sentence-transformers) would raise recall here.
silent_churn: conversations with multi-turn engagement and no confirmation keyword from the user. Negative set is healthy conversations which always contain a 'thanks' or similar resolution phrase.
escalation_burial: conversations containing explicit human-agent requests followed by bot deflection. Negative set is healthy conversations without any escalation language.