K
KairosRoute
Blog/Scaling an AI Support Agent from 5K to 100K Tickets a Month
Engineering9 min readKairosRoute

Scaling an AI Support Agent from 5K to 100K Tickets a Month

There is a specific, painful moment that every support-ops lead running an AI deflection agent hits. The product starts working. Deflection creeps from 28% to 41%. Tickets-per-week climbs from 5,000 to 20,000 to 60,000. And then someone pulls the invoice and the conversation becomes: we are paying $1.40 per ticket to deflect a ticket that a human could have answered for $3.20. That is a real number from a real customer. On 100K tickets a month, that is $140K in COGS against the $25K-ish you would pay a Tier 1 BPO.

AI deflection only works economically if your blended cost-per-ticket is well under your human cost-per-ticket. At small volume, nobody notices the gap. At scale, it is the whole business. This post is the playbook I wish I had read when I was in the seat.

The cost-per-ticket decomposition

Before you can optimize, you need to know what you are optimizing. A support agent ticket is not one LLM call. It is a loop of them. Here is a representative trace for a Tier 2 SaaS product:

StageCalls per ticketModel (before)Avg in tokensAvg out tokensCost stage
Intent classification1gpt-4o18040$0.003
Context retrieval + re-rank3claude-sonnet-4-51,80080$0.041
Solution drafting1claude-opus-4-14,200900$0.139
Tone + policy check1claude-opus-4-11,100120$0.033
Final response generation1claude-opus-4-14,800450$0.117
Summarization + handoff prep1gpt-4o2,000300$0.011
Per-ticket total8$0.344

At 100K tickets that is $34,400 a month, which sounds great — until you remember that your p95 ticket runs twice the loop (clarification follow-ups, tool calls) and your cost-per-ticket at the tail is closer to $0.90. Blend in a 12% rate of escalation to human where the AI cost is wasted, and your true all-in number is around $0.62/ticket, or $62K/month at 100K.

The quality-stratified cost target

The move is to stratify your loop by quality requirement. You are not picking one model; you are picking five. Here is the target architecture for the same agent at 100K/month:

StageTarget tierWhyCost stage (after)
Intent classificationFast/cheap (Haiku-class)Deterministic 12-class problem$0.0003
Context retrieval + re-rankFast/cheap (Haiku-class)Scoring, not generation$0.004
Solution draftingFrontier (Opus/4.1)Core quality moat$0.139
Tone + policy checkMid-tier (Sonnet/4o)Rule-following with some nuance$0.008
Final response generationFrontier (Opus/4.1)Customer-facing, CSAT-driven$0.117
Summarization + handoff prepFast/cheapSummaries are easy$0.001
Per-ticket total$0.269

That is a 22% blended reduction from stratification alone. Adding kr-auto on the mid-tier and frontier stages — where the router picks the cheapest model that passes the quality threshold — drops the total to about $0.18/ticket, a 47% reduction. At 100K/month, that is $18K instead of $34K in base cost and closer to $32K instead of $62K all-in.

The part everyone skips: shadow evaluation

The failure mode of the architecture above is "we moved classification to Haiku and 3 weeks later our deflection rate dropped 6% and nobody knew why." That is the silent regression problem. The fix is boring and non-negotiable: shadow-run the new model in parallel with the old, on live traffic, before you cut over.

The mechanic is simple. For every production ticket:

  1. The primary stage runs on the current model, as it always has. Response goes to the user.
  2. The shadow stage runs the same prompt on the candidate cheaper model, in parallel. Response is logged but never served.
  3. A scoring job compares the two outputs on task-specific criteria — for classification, did they agree; for drafting, an LLM-as-judge score on factual accuracy, tone, and policy adherence.
  4. After 10-20K paired samples, you have a confident regression delta. If it is within your SLO, you cut over. If not, you stay.

This is the engineering discipline that separates teams who scale AI support from teams who blow up. See Silent Quality Regression for the full methodology and LLM A/B Testing in Production for how to run these experiments without capsizing your CSAT score.

The guardrail layer

Even with shadow evals, you want runtime guardrails. Three are worth building:

  • Confidence gate on final response. If the drafter returns below a threshold, skip the policy check and escalate straight to human. Cheaper than publishing a bad answer.
  • CSAT-weighted sampling. 2% of tickets get a second opinion from the next tier up, logged and scored. This catches distribution drift that shadow evals miss because training data shifted.
  • Cost ceiling per ticket. If the loop exceeds your budget multiple (say 3x expected cost), break out and hand off. Runaway tickets are where the worst cost spikes come from.

All three are a few dozen lines of middleware if you own the orchestration. If you are on a gateway, they are typically config.

What changes at 100K tickets

A lot of the small stuff you ignore at 5K matters at 100K. Specifically:

Caching is not optional

A prompt cache on the retrieval + re-rank stage can drop up to 40% of your per-ticket token bill, because support questions cluster hard — password resets, billing questions, top 20 product features. Semantic caching needs tuning (false positives are bad), but exact-match caching on system prompts and tool schemas is pure upside.

Provider redundancy becomes load-bearing

At 5K tickets/month, a 45-minute provider outage is annoying. At 100K, it is a CSAT-damaging incident. Multi-provider routing with automatic failover is worth the 5% latency overhead it adds. A single-provider support stack is a single point of failure for your whole SLA.

Per-customer attribution

Your 90th-percentile customer by ticket volume is almost certainly driving 60%+ of your AI spend. If you do not attribute cost per customer, you cannot price them correctly, and you cannot have the "we need to move you to the enterprise tier" conversation backed by numbers. The receipts stream from the gateway lands in your warehouse and solves this for free.

The 30-day migration plan

  1. Week 1: Install the gateway. Passthrough mode only — no routing changes. Collect a week of receipts.
  2. Week 2: Stratify the loop on paper. Pick your three cheap-tier candidate stages. Set up shadow evals on each.
  3. Week 3: Review shadow eval deltas. Cut over the stages that pass SLO. Leave the rest.
  4. Week 4: Turn on kr-auto with quality thresholds on the frontier stages. Add the three runtime guardrails. Wire the Slack digest.

Four weeks, measurable every day, reversible at any point. The worst case if you stop halfway is a 20-25% cost reduction from stratification alone. The expected case is 45-60%.

The bigger point

Routing is the wedge. You install it to cut the bill, and you keep it because it is the only place you can actually see what your agent is doing, where your quality is degrading, and which customers are unprofitable. The cost reduction pays for itself in month one; the analytics make you a credible operator every month after.

If you want to see what your own support agent looks like through that lens, the easiest entry is the playground — send a representative ticket prompt through kr-auto, watch the routing decision and per-stage breakdown, and extrapolate to your ticket volume. Ten minutes, no integration required.

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Related Reading

Silent Quality Regression: The LLM Bug You Never Notice

Your model bill went down 20%. Nobody complained. Three weeks later, your agent's resolution rate has quietly dropped 12%. This is silent quality regression — and it is the single most dangerous failure mode in LLM ops.

A/B Testing LLMs in Production Without Shipping a Regression

You want to test GPT-5.4 vs Claude Sonnet on your real traffic. Here's how to run that A/B — sample sizing, the metrics that matter, guardrails that prevent user harm, and the statistics — without a PhD in experimentation.

The Cheapest-Model-Per-Stage Pattern for Production RAG

Most RAG pipelines run every stage on the same frontier model. That is the single biggest cost leak in production AI. Here is the stage-by-stage model selection pattern, with a concrete per-query cost breakdown.