Blog/Semantic Routing vs. Classifier Routing: What Actually Works in Production

EngineeringApril 6, 2026•11 min read•KairosRoute

Semantic Routing vs. Classifier Routing: What Actually Works in Production

Every LLM router faces the same question on every incoming request: what is this request actually asking the model to do? The answer determines which candidate models are fit to serve it. Get the classification wrong and you either overspend (route an easy task to Opus) or ship a quality regression (route a hard task to Haiku).

There are three mainstream approaches. They are not equivalent. This post lays out the tradeoffs with enough detail that you can pick one for your own stack.

The three approaches

Semantic / embedding similarity. Embed the incoming prompt. Compare it to a set of labeled anchor examples. Pick the nearest anchor's class.
Zero-shot LLM classifier. Ask a small fast LLM (e.g. GPT-5.4-mini) to classify the request into one of N categories. Use the answer.
Trained classifier. Fine-tune a small transformer on labeled production traffic. Use its prediction.

All three work. They work differently on cost, latency, accuracy, and maintenance burden. The best choice depends on where you are in your product's lifecycle.

Semantic similarity

How it works

You pick N anchor prompts per category (typically 5–20 per category). You embed each anchor ahead of time and store the vectors. At request time, embed the incoming prompt, compute cosine similarity against every anchor, and take the argmax-category.

python

anchor_vectors = embed_batch([
    # extraction
    "What is the customer's stated intent in this message?",
    "Extract all dates mentioned in the following text.",
    # summarization
    "Summarize this email thread in 3 bullets.",
    ...
])

def classify(prompt: str) -> str:
    v = embed(prompt)
    sims = cosine_sim(v, anchor_vectors)
    return anchors[sims.argmax()].category

Pros

Cheap at inference. One small embedding call plus a vector comparison.
Fast. Sub-20ms including network.
Easy to set up. No training pipeline required.
Easy to iterate. Adding a new category is "add 10 more anchors."

Cons

Hits a ceiling around ~85% accuracy on realistic production traffic. You can get to 88% with careful anchor selection, but 92%+ is hard.
Breaks when prompts are long. Embedding a 2K-token prompt loses the signal in the average; important clues ("please write Python code that...") get diluted.
Brittle across domains. An anchor that says "Classify intent" won't catch "Is this email spammy?" which is also intent classification, because surface similarity is lexical.
Blind to structured features. If the request has function definitions, a long conversation history, or a vision input, similarity-by-text misses it entirely.

When to use it

Semantic similarity is a fine choice for a v0. If you're at a product stage where the difference between 85% and 95% routing accuracy doesn't show up on your P&L, ship this and move on. You'll know when you've outgrown it because your cost-per-task-type chart will stop making sense.

Zero-shot LLM classifier

How it works

On every request, you call a small fast LLM with a prompt like "Classify the following user request into one of: extraction, summarization, code, reasoning, analysis, frontier." You parse the output.

python

def classify(prompt: str) -> str:
    resp = openai.chat.completions.create(
        model="gpt-5.4-mini",
        messages=[
            {"role": "system", "content": CLASSIFY_PROMPT},
            {"role": "user", "content": prompt},
        ],
        max_tokens=5,
    )
    return resp.choices[0].message.content.strip()

Pros

Better accuracy than semantic similarity out of the box — ~90–93% on realistic traffic with a decent classifier prompt.
No training pipeline. You iterate by editing the classifier prompt.
Handles long prompts and unusual structure reasonably well because the classifier actually reads the input.

Cons

Expensive per request. Even a cheap classifier model costs money. If your base request is $0.0005 and the classifier costs $0.0003, you've just added 60% to your per-request fixed cost. That's before any provider outage penalties.
Latency hit. 100–300ms added to every request. On a base request that's 500ms, that's a 40% latency regression.
You're now dependent on another provider. If the classifier's provider has an outage, your routing fails too. Ironic.
Accuracy varies with classifier model. If the small model you're using gets deprecated, you re-validate.

When to use it

Zero-shot LLM classification is great for prototyping and for workloads where the math works out ("I'm routing to $3/M tokens Sonnet anyway, so a $0.15/M mini classifier is rounding error"). It's not a great choice for high-volume, cost-sensitive routing — you end up burning most of your savings on the classifier.

Trained classifier

How it works

You fine-tune a small transformer (a few million parameters) on labeled prompts. At request time, you run the transformer locally (or on a private edge node) and take its prediction.

python

model = load_classifier("kr-classifier-v4.2.pt")

def classify(prompt: str, features: dict) -> tuple[str, float]:
    x = tokenize_and_featurize(prompt, features)
    logits = model.forward(x)
    probs = softmax(logits)
    return labels[probs.argmax()], probs.max()

Pros

Highest accuracy. We hit ~96% on our held-out eval set. Beats semantic similarity by ~8 points and zero-shot by ~3–6 points.
Fast. 15–40ms on small hardware.
Cheap per inference. No per-call API costs.
Can ingest arbitrary features: prompt embedding, token count, code markers, tool schemas, response format, conversation depth. More features → more accuracy.
Retrainable. Every 24 hours, we collect new production signals and update the weights. The classifier gets better as your workload evolves.

Cons

Requires a training pipeline. You need labeled data and a retraining cadence, or the classifier goes stale.
Requires MLOps. You need eval sets, deployment infra, rollback, model versioning.
Only as good as its training data. Start with bad labels, ship bad routing.
A six-month project from scratch, not a weekend.

When to use it

Trained classifiers are correct when routing quality affects your bottom line and the cost of a 3-point accuracy gain justifies the engineering burden. If you're in the early exploration phase this is overkill. If you're processing a million requests a day and any routing error has a compounding cost impact, this is the only approach that makes sense.

Head-to-head on one workload

We ran the three approaches on the same 5,000-prompt evaluation set derived from anonymized customer traffic. Categories are our standard six (extraction, summarization, code, reasoning, analysis, frontier).

Approach	Accuracy	p50 Latency	$/1M classifications	Implementation days
Semantic similarity (MiniLM)	86.4%	18ms	$2.10	~3 days
Zero-shot LLM (gpt-5.4-mini)	91.2%	180ms	$140	~1 day
Trained classifier (ours)	95.8%	22ms	$0.08	~90 days

Translation: if you're building from scratch, semantic similarity buys you 80% of the quality for 3% of the effort. Zero-shot is the easiest "quick and better" upgrade. Trained classifiers dominate on cost and latency once they exist — but the month 3 delivery date is a real tax.

The hybrid everyone eventually builds

What usually happens at a real company: they start with semantic similarity because it's fastest to ship. Six months later, they realize 15% of traffic is getting mis-routed, so they layer a zero-shot LLM classifier on top to handle the hard cases. A year in, the zero-shot classifier costs more than the routing savings, so they either fine-tune their own model or buy one. It takes 3 engineering years cumulatively to do this correctly.

This is one of the reasons KairosRoute exists. We built the trained classifier so you don't have to staff a router team. If you're optimizing for speed-to-market, piggyback on ours (model="auto") and focus on your product.

Key takeaways

Routing accuracy above ~90% requires either a zero-shot LLM or a trained classifier.
Zero-shot LLM classifiers are good until the classifier's per-call cost erodes routing savings — which happens faster than people expect.
Trained classifiers win on cost, latency, and accuracy once they exist, but building one from scratch is a project measured in quarters.
Features matter. Adding token count, code markers, and tool-schema awareness is worth 3–5 points of accuracy on top of prompt embedding alone.
Retraining cadence matters. A classifier trained once and deployed forever degrades. Monthly retraining is a floor, not a ceiling.

Want to skip the build? What kr-auto Does covers what you get out of the box. The playground runs a prompt of your choosing through the router live, and the public benchmarks show how it stacks up against the popular cheap and frontier models on the same eval suite.

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Calculate Your Savings Start Free — 100K tokens

Semantic Routing vs. Classifier Routing: What Actually Works in Production

The three approaches

Semantic similarity

How it works

Pros

Cons

When to use it

Zero-shot LLM classifier

How it works

Pros

Cons

When to use it

Trained classifier

How it works

Pros

Cons

When to use it

Head-to-head on one workload

The hybrid everyone eventually builds

Key takeaways

Ready to route smarter?

Related Reading