Semantic Routing vs. Classifier Routing: What Actually Works in Production
Every LLM router faces the same question on every incoming request: what is this request actually asking the model to do? The answer determines which candidate models are fit to serve it. Get the classification wrong and you either overspend (route an easy task to Opus) or ship a quality regression (route a hard task to Haiku).
There are three mainstream approaches. They are not equivalent. This post lays out the tradeoffs with enough detail that you can pick one for your own stack.
The three approaches
- Semantic / embedding similarity. Embed the incoming prompt. Compare it to a set of labeled anchor examples. Pick the nearest anchor's class.
- Zero-shot LLM classifier. Ask a small fast LLM (e.g. GPT-5.4-mini) to classify the request into one of N categories. Use the answer.
- Trained classifier. Fine-tune a small transformer on labeled production traffic. Use its prediction.
All three work. They work differently on cost, latency, accuracy, and maintenance burden. The best choice depends on where you are in your product's lifecycle.
Semantic similarity
How it works
You pick N anchor prompts per category (typically 5–20 per category). You embed each anchor ahead of time and store the vectors. At request time, embed the incoming prompt, compute cosine similarity against every anchor, and take the argmax-category.
anchor_vectors = embed_batch([
# extraction
"What is the customer's stated intent in this message?",
"Extract all dates mentioned in the following text.",
# summarization
"Summarize this email thread in 3 bullets.",
...
])
def classify(prompt: str) -> str:
v = embed(prompt)
sims = cosine_sim(v, anchor_vectors)
return anchors[sims.argmax()].categoryPros
- Cheap at inference. One small embedding call plus a vector comparison.
- Fast. Sub-20ms including network.
- Easy to set up. No training pipeline required.
- Easy to iterate. Adding a new category is "add 10 more anchors."
Cons
- Hits a ceiling around ~85% accuracy on realistic production traffic. You can get to 88% with careful anchor selection, but 92%+ is hard.
- Breaks when prompts are long. Embedding a 2K-token prompt loses the signal in the average; important clues ("please write Python code that...") get diluted.
- Brittle across domains. An anchor that says "Classify intent" won't catch "Is this email spammy?" which is also intent classification, because surface similarity is lexical.
- Blind to structured features. If the request has function definitions, a long conversation history, or a vision input, similarity-by-text misses it entirely.
When to use it
Semantic similarity is a fine choice for a v0. If you're at a product stage where the difference between 85% and 95% routing accuracy doesn't show up on your P&L, ship this and move on. You'll know when you've outgrown it because your cost-per-task-type chart will stop making sense.
Zero-shot LLM classifier
How it works
On every request, you call a small fast LLM with a prompt like "Classify the following user request into one of: extraction, summarization, code, reasoning, analysis, frontier." You parse the output.
def classify(prompt: str) -> str:
resp = openai.chat.completions.create(
model="gpt-5.4-mini",
messages=[
{"role": "system", "content": CLASSIFY_PROMPT},
{"role": "user", "content": prompt},
],
max_tokens=5,
)
return resp.choices[0].message.content.strip()Pros
- Better accuracy than semantic similarity out of the box — ~90–93% on realistic traffic with a decent classifier prompt.
- No training pipeline. You iterate by editing the classifier prompt.
- Handles long prompts and unusual structure reasonably well because the classifier actually reads the input.
Cons
- Expensive per request. Even a cheap classifier model costs money. If your base request is $0.0005 and the classifier costs $0.0003, you've just added 60% to your per-request fixed cost. That's before any provider outage penalties.
- Latency hit. 100–300ms added to every request. On a base request that's 500ms, that's a 40% latency regression.
- You're now dependent on another provider. If the classifier's provider has an outage, your routing fails too. Ironic.
- Accuracy varies with classifier model. If the small model you're using gets deprecated, you re-validate.
When to use it
Zero-shot LLM classification is great for prototyping and for workloads where the math works out ("I'm routing to $3/M tokens Sonnet anyway, so a $0.15/M mini classifier is rounding error"). It's not a great choice for high-volume, cost-sensitive routing — you end up burning most of your savings on the classifier.
Trained classifier
How it works
You fine-tune a small transformer (a few million parameters) on labeled prompts. At request time, you run the transformer locally (or on a private edge node) and take its prediction.
model = load_classifier("kr-classifier-v4.2.pt")
def classify(prompt: str, features: dict) -> tuple[str, float]:
x = tokenize_and_featurize(prompt, features)
logits = model.forward(x)
probs = softmax(logits)
return labels[probs.argmax()], probs.max()Pros
- Highest accuracy. We hit ~96% on our held-out eval set. Beats semantic similarity by ~8 points and zero-shot by ~3–6 points.
- Fast. 15–40ms on small hardware.
- Cheap per inference. No per-call API costs.
- Can ingest arbitrary features: prompt embedding, token count, code markers, tool schemas, response format, conversation depth. More features → more accuracy.
- Retrainable. Every 24 hours, we collect new production signals and update the weights. The classifier gets better as your workload evolves.
Cons
- Requires a training pipeline. You need labeled data and a retraining cadence, or the classifier goes stale.
- Requires MLOps. You need eval sets, deployment infra, rollback, model versioning.
- Only as good as its training data. Start with bad labels, ship bad routing.
- A six-month project from scratch, not a weekend.
When to use it
Trained classifiers are correct when routing quality affects your bottom line and the cost of a 3-point accuracy gain justifies the engineering burden. If you're in the early exploration phase this is overkill. If you're processing a million requests a day and any routing error has a compounding cost impact, this is the only approach that makes sense.
Head-to-head on one workload
We ran the three approaches on the same 5,000-prompt evaluation set derived from anonymized customer traffic. Categories are our standard six (extraction, summarization, code, reasoning, analysis, frontier).
| Approach | Accuracy | p50 Latency | $/1M classifications | Implementation days |
|---|---|---|---|---|
| Semantic similarity (MiniLM) | 86.4% | 18ms | $2.10 | ~3 days |
| Zero-shot LLM (gpt-5.4-mini) | 91.2% | 180ms | $140 | ~1 day |
| Trained classifier (ours) | 95.8% | 22ms | $0.08 | ~90 days |
Translation: if you're building from scratch, semantic similarity buys you 80% of the quality for 3% of the effort. Zero-shot is the easiest "quick and better" upgrade. Trained classifiers dominate on cost and latency once they exist — but the month 3 delivery date is a real tax.
The hybrid everyone eventually builds
What usually happens at a real company: they start with semantic similarity because it's fastest to ship. Six months later, they realize 15% of traffic is getting mis-routed, so they layer a zero-shot LLM classifier on top to handle the hard cases. A year in, the zero-shot classifier costs more than the routing savings, so they either fine-tune their own model or buy one. It takes 3 engineering years cumulatively to do this correctly.
This is one of the reasons KairosRoute exists. We built the trained classifier so you don't have to staff a router team. If you're optimizing for speed-to-market, piggyback on ours (model="auto") and focus on your product.
Key takeaways
- Routing accuracy above ~90% requires either a zero-shot LLM or a trained classifier.
- Zero-shot LLM classifiers are good until the classifier's per-call cost erodes routing savings — which happens faster than people expect.
- Trained classifiers win on cost, latency, and accuracy once they exist, but building one from scratch is a project measured in quarters.
- Features matter. Adding token count, code markers, and tool-schema awareness is worth 3–5 points of accuracy on top of prompt embedding alone.
- Retraining cadence matters. A classifier trained once and deployed forever degrades. Monthly retraining is a floor, not a ceiling.
Want to skip the build? What kr-auto Does covers what you get out of the box. The playground runs a prompt of your choosing through the router live, and the public benchmarks show how it stacks up against the popular cheap and frontier models on the same eval suite.
Ready to route smarter?
KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.
Related Reading
kr-auto picks the right model for every request, gets smarter from your own traffic, and gives you a receipt for the decision. Here is what that actually buys you — and why teams who try to roll their own spend six months getting it wrong.
Everything you need to know about LLM routers — what they are, how they work, why 70% of your model calls are routed wrong, and how to pick one without regretting it six months in.
Your model bill went down 20%. Nobody complained. Three weeks later, your agent's resolution rate has quietly dropped 12%. This is silent quality regression — and it is the single most dangerous failure mode in LLM ops.