Skip to content

Benchmark Results

Generated: 2026-04-17T14:54:21+00:00 Seed: 20260417 Corpus size per row: 500 positive + 500 negative conversations

Detector / variant Precision Recall F1 TP FP TN FN
death_loop / exact paraphrase 1.000 1.000 1.000 500 0 500 0
death_loop / low paraphrase 1.000 1.000 1.000 500 0 500 0
death_loop / high paraphrase 0.000 0.000 0.000 0 0 500 500
silent_churn 1.000 1.000 1.000 500 0 500 0
escalation_burial 1.000 1.000 1.000 500 0 500 0

Notes

  • death_loop / exact: bot repeats identical responses. Expected to be trivially detected.
  • death_loop / low paraphrase: bot repeats with small conversational prefixes (e.g. 'Again, ...'). Lexical similarity preserved; should be reliably detected.
  • death_loop / high paraphrase: bot rotates through lexically-distinct paraphrases. Known ceiling for lexical default — a semantic similarity backend (sentence-transformers) would raise recall here.
  • silent_churn: conversations with multi-turn engagement and no confirmation keyword from the user. Negative set is healthy conversations which always contain a 'thanks' or similar resolution phrase.
  • escalation_burial: conversations containing explicit human-agent requests followed by bot deflection. Negative set is healthy conversations without any escalation language.