LLM & embedding backends¶

The built-in detectors use pure-stdlib backends by default: lexical similarity, keyword sentiment, regex-based safety checks. These are fast, dependency-free, and catch most real-world failures — but they have known ceilings. A paraphrased death loop slips past SequenceMatcher. Sarcasm slips past the keyword sentiment scorer. An off-brand joke dressed in neutral language slips past the pattern safety checker.

Each of these interfaces is pluggable. This tutorial shows how to swap in richer backends when you need them.

The similarity backend¶

DeathLoopDetector accepts a similarity_fn: Callable[[str, str], float]. The library ships with a semantic-similarity implementation based on sentence-transformers.

Install¶

pip install "chatbot-auditor[llm]"

This adds sentence-transformers and scikit-learn.

Use it¶

from chatbot_auditor import DeathLoopDetector, audit, default_registry
from chatbot_auditor.backends.embeddings import EmbeddingsSimilarity

backend = EmbeddingsSimilarity()  # downloads ~90 MB model on first use
semantic_loop = DeathLoopDetector(
    similarity_fn=backend,
    similarity_threshold=0.75,  # cosine similarity — lower than the 0.85 lexical default
)

registry = default_registry()
registry.unregister("death_loop")
registry.register(semantic_loop)

detections = audit(conversations, detectors=registry)

The embeddings backend caches results by text, so the same bot response scored against 10 other responses only runs the model once.

Why it matters — "high paraphrase" recall¶

In the benchmarks, the lexical default scores F1 = 0.00 on the "high paraphrase" synthetic corpus (bots rotating through semantically equivalent but lexically distinct responses). That's a known ceiling — solved by swapping in embeddings:

# Run the benchmark with the semantic backend.
from chatbot_auditor.backends.embeddings import EmbeddingsSimilarity

backend = EmbeddingsSimilarity()
detector = DeathLoopDetector(similarity_fn=backend, similarity_threshold=0.75)
# Feed detector into the benchmark harness to see recall climb.

Users report ≥ 0.85 F1 on the same high-paraphrase corpus with the MiniLM model.

Model selection¶

Any sentence-transformers model works. The default (all-MiniLM-L6-v2) is the best CPU tradeoff, but you can switch:

# Larger, more accurate, needs more RAM:
backend = EmbeddingsSimilarity(model_name="sentence-transformers/all-mpnet-base-v2")

# GPU-backed for batch workloads:
backend = EmbeddingsSimilarity(device="cuda")

Injecting a custom encoder¶

For tests or custom models, pass model=... directly — anything with an encode(sentences, **kwargs) -> np.ndarray method works:

class MyEncoder:
    def encode(self, sentences, **kwargs):
        return my_embedding_service(sentences)

backend = EmbeddingsSimilarity(model=MyEncoder())

This is how the library's own tests run without downloading a real model.

The sentiment backend¶

SentimentCollapseDetector accepts a scorer: SentimentScorer — any object with a score(text: str) -> float method returning [-1.0, 1.0].

VADER (lightweight, no torch)¶

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from chatbot_auditor import SentimentCollapseDetector

class VaderScorer:
    def __init__(self):
        self._v = SentimentIntensityAnalyzer()
    def score(self, text: str) -> float:
        return self._v.polarity_scores(text)["compound"]

detector = SentimentCollapseDetector(scorer=VaderScorer())

Hugging Face model¶

from transformers import pipeline
from chatbot_auditor import SentimentCollapseDetector

class TransformerScorer:
    def __init__(self, model: str = "cardiffnlp/twitter-roberta-base-sentiment-latest"):
        self._pipe = pipeline("sentiment-analysis", model=model)
    def score(self, text: str) -> float:
        result = self._pipe(text[:512])[0]
        signed = 1.0 if result["label"].lower() == "positive" else (
            -1.0 if result["label"].lower() == "negative" else 0.0
        )
        return signed * result["score"]

detector = SentimentCollapseDetector(scorer=TransformerScorer())

LLM API¶

from chatbot_auditor import SentimentCollapseDetector

class LLMScorer:
    def __init__(self, client):
        self._client = client  # your OpenAI / Anthropic / Ollama client

    def score(self, text: str) -> float:
        response = self._client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=4,
            messages=[{
                "role": "user",
                "content": f"Score sentiment from -1 to 1: {text}\nJust the number:",
            }],
        )
        return float(response.content[0].text.strip())

# Cache aggressively — don't call the LLM for every message.

Watch latency. LLM calls per message will dominate audit time for large batches.

The content safety backend¶

BrandDamageDetector accepts a checker: ContentSafetyChecker — any object with a check(text: str) -> list[str] method returning violation labels.

OpenAI Moderation¶

from openai import OpenAI
from chatbot_auditor import BrandDamageDetector

class OpenAIModerationChecker:
    def __init__(self):
        self._client = OpenAI()

    def check(self, text: str) -> list[str]:
        result = self._client.moderations.create(model="omni-moderation-latest", input=text)
        categories = result.results[0].categories
        return [k for k, v in categories.model_dump().items() if v]

detector = BrandDamageDetector(checker=OpenAIModerationChecker())

Policy-specific classifier¶

Your safety bar might include things beyond toxicity — mentions of a competitor, off-brand humor, inappropriate topics. Implement them against the same interface:

class CompanyPolicyChecker:
    def check(self, text: str) -> list[str]:
        violations = []
        if competitor_mentioned(text):
            violations.append("competitor_mention")
        if off_brand_tone(text):
            violations.append("off_brand")
        if llm_safety_classifier_says_unsafe(text):
            violations.append("llm_flagged")
        return violations

Combine any of these with the default PatternSafetyChecker by composing:

from chatbot_auditor import PatternSafetyChecker

class CombinedChecker:
    def __init__(self):
        self._pattern = PatternSafetyChecker(competitor_names=("Acme",))
        self._llm = OpenAIModerationChecker()

    def check(self, text: str) -> list[str]:
        return self._pattern.check(text) + self._llm.check(text)

Principles¶

Start with the stdlib defaults. They catch the majority of real failures and require zero configuration. Swap in richer backends when you see a class of failure they can't detect.
Cache at the backend. Embeddings, transformer inference, LLM calls — all expensive. Cache by input text. The EmbeddingsSimilarity backend shows the pattern.
Measure before you switch. Run the benchmark harness with both backends on your own data before committing. The default isn't always wrong.
Keep backends independent from detectors. Your backend code shouldn't import from the detector module. The detector calls your function; your function doesn't know which detector is calling.

Detector concepts — where the pluggable interfaces live
Write a custom detector — when you need more than a new backend