Silent Quality Regression: The LLM Bug You Never Notice
Here's a story. A team switches their support agent from Sonnet to Haiku to save on costs. The cost chart drops 65%. The engineering channel gets a trophy emoji. Three weeks later, the head of support notices resolution rate is down 12%, CSAT is down 0.3 points, and the number of escalations to human agents has crept up. Nobody connected the dots for 21 days because the dashboard that would have caught it didn't exist.
This is silent quality regression. It happens when model outputs get worse in ways your dashboards don't immediately capture — small enough that no individual user complains, large enough that aggregate metrics degrade. If you're not watching for it, you'll catch it eventually, but "eventually" is weeks, and the cost of those weeks is material.
Why silent regressions are hard
Most monitoring stacks watch for hard failures: 5xx responses, exceptions, dropped requests. LLM regressions are softer. The model returns a 200. The JSON parses. The tool call runs. It just… produces a slightly worse answer. The failure mode is semantic, not structural.
The classic failure surfaces:
- Completion length drifts. Answers get 15% shorter. Looks fine in any individual response. Aggregated, it means the model is dropping the clarifying examples and reasoning steps that made the output genuinely helpful.
- Tool-call success rate drops. The agent calls the wrong tool, or calls the right tool with slightly wrong arguments. The first retry usually succeeds. You see the retries in logs if you're looking; nobody's looking.
- Ambiguous-question handling degrades. Easy tickets still resolve. Hard tickets, which require the model to ask a clarifying question rather than assume, get answered with confident-but-wrong assumptions. Resolution rate on the hard 10% of tickets drops from 80% to 55%. Overall resolution rate moves from 92% to 89% — "within noise."
- Tone shifts. The new model writes a more clipped, less warm response. CSAT drops a third of a point. Explains 100% of a CSAT drop.
- Over-refusal. The new model is more conservative about edge cases and starts refusing requests the old model handled fine. Ambiguity → refusal → user retries → user retries again → user gives up.
The four signals that catch it
Building a quality-regression detection system isn't magic. It is four signals, aggregated continuously, with thresholds and alerts.
Signal 1: Completion length (the cheap tripwire)
For each (feature_id × task_type) bucket, track the median output token count over a rolling window. If today's median drops more than 15% vs. the prior 7-day median, trigger an alert.
This is the cheapest possible regression detector. It's surprisingly useful. Long answer → short answer is almost always a signal something changed — model swap, prompt change, truncation. You'll get false positives (a prompt change that legitimately shortened outputs), but those are cheap to investigate and discard.
Signal 2: Tool-call outcomes (for agents)
If your product uses tools, log every tool call with: tool name, argument validity (did the arguments parse?), tool return status (2xx/4xx/5xx from your tool), and whether the agent used the tool result vs. ignoring it.
The two aggregates to alert on: tool-call validity rate (did the model call a real tool with valid args?) and tool-call utility rate (did the agent actually integrate the tool's result into its next step?). Both drop when model quality degrades.
Signal 3: Retry / regenerate / abandon
On the user side: did the user click regenerate, retry the same prompt, or abandon the session without a response marked as "useful"? These implicit behavioral signals are noisy at the individual level and strong at the aggregate level. The regenerate-rate per feature per day is a leading indicator.
Signal 4: Downstream frontier retry
This one is specific to AI-native products: did the user's agent — immediately after your model returned — call a frontier model like Opus or GPT-5.4 with the same input? That's the agent's author saying "the cheaper model I routed to didn't do it, let me retry with the big one." We see this in the wild; it's a very clean regression signal if your product supports agent chains.
Aggregating signals into a regression score
Any single signal will produce false positives. The trick is aggregation. We compute a per-workload regression score daily:
def regression_score(window):
signals = {
"length_drop": z_score(window.median_output, baseline.median_output),
"tool_validity": z_score(window.tool_validity_rate, baseline.tool_validity_rate),
"regen_rate": z_score(window.regen_rate, baseline.regen_rate),
"frontier_retry": z_score(window.frontier_retry_rate, baseline.frontier_retry_rate),
}
# Weighted sum; tuned per workload.
score = sum(w * s for w, s in zip(WEIGHTS, signals.values()))
return score
# Trigger: score > threshold (2.0 is a reasonable default).When the score crosses the threshold, you have high-signal evidence that something shifted. The dashboard lights up; the alerter pages whoever owns that workload; the routing table auto-rolls-back the last change if the signal is strong enough.
The auto-rollback pattern
The most valuable thing quality-regression detection enables is automated rollback. Human-in-the-loop investigation is great; a faster loop is "we pushed a routing change, the regression score spiked, we reverted it within 4 hours with zero human intervention."
KairosRoute does this automatically on our managed routing table. If we route more of your summarization traffic from Sonnet to Haiku and your regression score rises, we rebalance back toward Sonnet until the score normalizes. The customer sees a slight cost uptick in the dashboard with a note explaining why. That's the feedback loop that keeps routing changes safe.
You can build the same pattern yourself. The ingredients: a quality signal pipeline, a versioned routing config, and the discipline to trust the rollback when it triggers.
The test set nobody maintains
Alongside continuous signals, you want a small, curated golden test set — 200–500 prompts that are diverse enough to represent your workload, with "what a good response looks like" rubrics attached. Run this set against any candidate model change before rolling it to production.
Scoring can be done with a stronger model ("use GPT-5.4 to rate these Haiku outputs vs. these Sonnet outputs") or with structural rubrics (length, JSON validity, tool-call correctness). The point isn't perfection — it's catching the "Haiku hallucinates in 8% of these specific prompts" signal before it hits a real user.
Most teams never build this and regret it. It's a one-week project for a durable win.
What we do (the unsolicited plug)
The quality regression pipeline is one of the KairosRoute product's core differentiators. Every routed request passes through:
- Completion length tracking per workload.
- Tool-call validity & utility tracking (when your request includes tools).
- Optional user feedback API (
POST /v1/feedback) to label regenerates / abandons / explicit up-downs. - Aggregate regression score with per-workload threshold.
- Auto-rollback if the score exceeds threshold after a routing change.
We surface the score in the dashboard alongside cost so you can see both axes at once. A routing change that cuts cost 30% is only a win if the regression score stays flat.
Takeaways
- LLM regressions fail softly. Your normal alerting won't catch them.
- Four signals are enough: length, tool validity, regenerate rate, frontier retry.
- Aggregate into a score; alert on threshold crossings.
- Auto-rollback turns detection from "ticket in someone's queue" into "noticeable hiccup then recovery."
- Maintain a golden test set. 200 prompts is enough.
If you'd rather get this bundled: sign up. Routing through us gets you the signal pipeline and the dashboard out of the box. If you're already routing through someone else, we're happy to help you instrument your own pipeline — here's the telemetry stack guide.
Ready to route smarter?
KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.
Related Reading
You want to test GPT-5.4 vs Claude Sonnet on your real traffic. Here's how to run that A/B — sample sizing, the metrics that matter, guardrails that prevent user harm, and the statistics — without a PhD in experimentation.
The OpenAI invoice tells you what you spent. It does not tell you what it was spent on. Here is the observability gap that costs AI teams 30–50% of their margin, and the minimum stack to close it.
kr-auto picks the right model for every request, gets smarter from your own traffic, and gives you a receipt for the decision. Here is what that actually buys you — and why teams who try to roll their own spend six months getting it wrong.