Monitoring LLM behavior: Drift, retries, and refusal patterns (11 minute read)
Microsoft engineer outlines a two-layer evaluation framework for monitoring LLM systems in production, combining deterministic checks with model-based semantic assessments to catch failures before deployment.
What: A comprehensive framework called the "AI Evaluation Stack" that separates LLM testing into deterministic assertions (checking syntax, schema, routing) and model-based evaluations (semantic quality using "LLM-as-a-Judge"), with both offline pre-deployment pipelines using curated test datasets and online production monitoring that feeds back into continuous improvement.
Why it matters: Traditional unit testing breaks down for LLMs because the same prompt produces different outputs each time, making it impossible to rely on deterministic pass/fail checks alone. Enterprise AI systems face compliance risks from hallucinations and failures, requiring structured evaluation infrastructure instead of informal "vibe checks" that pass in development but fail when customers use the product.
Takeaway: Implement a two-pipeline evaluation system: build an offline regression suite with 200-500 "golden" test cases requiring 95%+ pass rates before deployment, then monitor production with both explicit feedback (thumbs up/down) and implicit signals (retry rates, refusal patterns) to continuously update your test dataset.
Deep dive
- Layer 1 deterministic assertions act as fail-fast gates that use traditional code and regex to validate structural integrity before expensive semantic checks run, catching issues like malformed JSON schemas, incorrect tool calls, or missing required arguments with instant binary pass/fail results
- Layer 2 model-based assertions use "LLM-as-a-Judge" architecture to evaluate semantic quality like helpfulness or tone, requiring three critical inputs: a frontier reasoning model superior to the production model, a strict scoring rubric with explicitly defined gradients (not vague "rate this" prompts), and human-vetted golden outputs as ground truth
- Offline pipelines gate pre-deployment with golden datasets of 200-500 test cases representing real-world traffic distributions including edge cases and adversarial inputs, integrated as blocking CI/CD steps with 95%+ pass rates required for enterprise (99%+ for high-risk domains)
- Composite scoring systems weight deterministic and semantic checks differently, such as allocating 6 points to structural validity (correct tool, valid JSON, schema compliance) and 4 points to semantic quality (subject line accuracy, hallucination-free content), with short-circuit logic that fails the entire test instantly if any deterministic check fails
- Any system modification requires full regression testing because LLM non-determinism means fixes for one edge case can cause unforeseen degradations elsewhere, making continuous re-evaluation against the entire golden dataset mandatory
- Online pipelines monitor five telemetry categories post-deployment: explicit user signals (thumbs up/down, written feedback), implicit behavioral signals (regeneration/retry rates, apology detection, refusal rates), synchronous deterministic asserts on 100% of traffic, and asynchronous LLM-Judge sampling ~5% of sessions
- Production LLM-Judges must run asynchronously rather than on the critical path to avoid doubling latency and compute costs, sampling a small fraction of daily sessions to generate continuous quality dashboards while respecting data privacy agreements
- The feedback flywheel prevents dataset rot by capturing production failures (negative signals or behavioral flags), triaging them for human review, conducting root-cause analysis, appending corrected cases to the golden dataset with synthetic variations, and continuously re-evaluating the model against newly discovered edge cases
- Synthetic data generation accelerates dataset curation but introduces contamination and bias risks, requiring mandatory human-in-the-loop review where domain experts validate AI-generated test cases before committing them to the repository
- Static golden datasets suffer from concept drift as user behavior evolves and customers discover novel use cases not covered in original evaluations, creating a dangerous illusion of high offline pass rates masking degrading real-world experiences
- Apology rate and refusal rate patterns reveal silent failures: programmatically scanning for phrases like "I'm sorry" detects degraded capabilities or broken tool routing, while artificially high refusal rates indicate over-calibrated safety filters rejecting benign queries
- The architecture redefines "done" for AI features as requiring not just coherent responses but rigorous automated evaluation pipelines that pass against both curated golden datasets and continuously discovered production edge cases
Decoder
- LLM-as-a-Judge: Using a large language model to evaluate the output quality of another LLM, serving as a scalable proxy for human judgment when assessing semantic qualities like helpfulness or tone that can't be captured with traditional code assertions
- Golden Dataset: A version-controlled repository of 200-500 human-reviewed test cases pairing exact input prompts with expected "golden outputs" (ground truth), representing the AI system's full operational envelope including edge cases and adversarial inputs
- Stochastic: Non-deterministic behavior where the same input produces different outputs, breaking traditional unit testing assumptions that Input A plus Function B always equals Output C
- Concept drift: The degradation of model performance over time as real-world user behavior and use cases evolve beyond what was covered in static training or evaluation datasets
- Short-circuit evaluation: Fail-fast logic that immediately terminates testing and returns a failure result when a critical condition isn't met, preventing wasteful execution of expensive downstream checks
- Tool call: When an LLM invokes a specific function or API with structured arguments rather than generating conversational text, typically requiring exact JSON schema compliance
- HITL (Human-in-the-Loop): Architecture requiring human review and validation at critical stages, such as verifying AI-generated test cases before adding them to the evaluation dataset
Original article
Monitoring LLM behavior necessitates adopting the AI Evaluation Stack, separating tests into deterministic assertions (syntax and routing integrity) and model-based evaluations (semantic quality). Engineers use offline pipelines for pre-deployment regression testing with human-reviewed "Golden Datasets" while online pipelines monitor real-world performance for drift and failures. A continuous feedback loop from production telemetry ensures AI systems adapt, maintaining high performance as user behavior evolves.