Benchmark Results¶
Generated: 2026-04-17T14:54:21+00:00 Seed: 20260417 Corpus size per row: 500 positive + 500 negative conversations
| Detector / variant | Precision | Recall | F1 | TP | FP | TN | FN |
|---|---|---|---|---|---|---|---|
| death_loop / exact paraphrase | 1.000 | 1.000 | 1.000 | 500 | 0 | 500 | 0 |
| death_loop / low paraphrase | 1.000 | 1.000 | 1.000 | 500 | 0 | 500 | 0 |
| death_loop / high paraphrase | 0.000 | 0.000 | 0.000 | 0 | 0 | 500 | 500 |
| silent_churn | 1.000 | 1.000 | 1.000 | 500 | 0 | 500 | 0 |
| escalation_burial | 1.000 | 1.000 | 1.000 | 500 | 0 | 500 | 0 |
Notes¶
- death_loop / exact: bot repeats identical responses. Expected to be trivially detected.
- death_loop / low paraphrase: bot repeats with small conversational prefixes (e.g. 'Again, ...'). Lexical similarity preserved; should be reliably detected.
- death_loop / high paraphrase: bot rotates through lexically-distinct paraphrases. Known ceiling for lexical default — a semantic similarity backend (sentence-transformers) would raise recall here.
- silent_churn: conversations with multi-turn engagement and no confirmation keyword from the user. Negative set is healthy conversations which always contain a 'thanks' or similar resolution phrase.
- escalation_burial: conversations containing explicit human-agent requests followed by bot deflection. Negative set is healthy conversations without any escalation language.