Blog/The Cheapest-Model-Per-Stage Pattern for Production RAG

EngineeringApril 13, 2026•10 min read•KairosRoute

The Cheapest-Model-Per-Stage Pattern for Production RAG

The single biggest mistake I see in production RAG systems is that every stage of the pipeline runs on the same model. The team picked Claude Sonnet (or GPT-4o, or Opus) for synthesis because that is what matters for final answer quality, and then reflexively used the same model for query expansion, re-ranking, and verification. It is organizationally clean and economically catastrophic.

A well-designed RAG pipeline has four to six stages, each with different quality requirements. Treating them uniformly is overpaying by 4-8x. This post is the stage-by-stage pattern I use, with a concrete per-query cost breakdown for a production system at scale.

The stages of a real RAG pipeline

Forget the "retrieve, then generate" diagrams. A real production RAG pipeline for a reasonable document volume looks like:

Query understanding / expansion. Rewrite the user's question into 2-4 retrieval queries that cover the semantic surface area.
Retrieval. Hit your vector index (and probably a BM25 index). Return top-K candidates. No LLM here.
Re-ranking. Score each candidate chunk for relevance against the query. Cut from top-K to top-N.
Synthesis. Generate the final answer given the top-N chunks and the original query.
Answer verification (optional but load-bearing). Check the generated answer for faithfulness to the source chunks. Flag or regenerate on failure.
Citation formatting. Rewrite the answer with inline citations. Easy, deterministic.

Each of these has a different cost-quality tradeoff. The mistake is treating them as one.

The quality requirement for each stage

Stage	Quality floor	Why	Target tier
Query expansion	Medium	Needs to generate plausible variants, not genius prose	Fast (Haiku-class)
Retrieval	N/A (no LLM)	Vector math, not inference	—
Re-ranking	Medium	Scoring task; classifier-like	Fast (Haiku-class)
Synthesis	High	The user-facing answer. Frontier matters.	Mid-to-frontier
Answer verification	High	Wrong verification is worse than no verification	Frontier
Citation formatting	Low	Rule-based rewriting	Fast (Haiku-class)

Three of the six LLM stages can run on a fast/cheap model with no measurable quality loss. One requires frontier. Two are in between.

The concrete per-query cost breakdown

Let us do the full math for a representative production RAG system: knowledge-base assistant, 50K documents indexed, average user query runs through the full pipeline. Token counts are for a moderately complex query with 8 retrieved chunks of 1,200 tokens each.

Before: all stages on Claude Sonnet

Stage	Model	In tokens	Out tokens	Cost per query
Query expansion	Sonnet	180	120	$0.0023
Re-ranking (8 chunks)	Sonnet	9,600	200	$0.0319
Synthesis	Sonnet	10,500	650	$0.0413
Answer verification	Sonnet	11,200	150	$0.0358
Citation formatting	Sonnet	900	700	$0.0132
Per-query total	—	—	—	$0.1245

$0.1245 per query. At 250K queries/month (typical for a mid-scale production system), that is $31,125/month in model costs.

After: stage-stratified with kr-auto

Stage	Model	In tokens	Out tokens	Cost per query
Query expansion	Haiku	180	120	$0.0002
Re-ranking (8 chunks)	Haiku	9,600	200	$0.0028
Synthesis	Sonnet (kr-auto)	10,500	650	$0.0413
Answer verification	Opus (high-stakes)	11,200	150	$0.1792
Citation formatting	Haiku	900	700	$0.0010
Per-query total	—	—	—	$0.2245

Wait — that went up. Because I moved verification from Sonnet to Opus, which is more expensive per token. Why would you ever do that?

The verification insight

This is the nuance most RAG cost guides miss. Verification is the highest-leverage stage you have. Done right, it lets you downgrade synthesis, because the verifier catches bad answers before they reach the user.

The actual pattern is:

Stage	Model	In tokens	Out tokens	Cost per query
Query expansion	Haiku	180	120	$0.0002
Re-ranking (8 chunks)	Haiku	9,600	200	$0.0028
Synthesis	Haiku (first pass)	10,500	650	$0.0068
Answer verification	Opus	11,200	150	$0.1792
Regenerate on verification fail (10% of queries)	Sonnet	10,500	650	$0.0041 amortized
Citation formatting	Haiku	900	700	$0.0010
Per-query total	—	—	—	$0.1941

Still worse than the baseline. The Opus verification is eating all the savings. The trick is not Opus — it is Sonnet as the verifier with a strict faithfulness rubric, and treating verification as a classifier problem.

Stage	Model	In tokens	Out tokens	Cost per query
Query expansion	Haiku	180	120	$0.0002
Re-ranking (8 chunks)	Haiku	9,600	200	$0.0028
Synthesis	Sonnet (kr-auto)	10,500	650	$0.0413
Answer verification (pass/fail only)	Haiku	11,200	30	$0.0093
Frontier second-pass on uncertain (8% of queries)	Opus	11,350	650	$0.0177 amortized
Citation formatting	Haiku	900	700	$0.0010
Per-query total	—	—	—	$0.0723

$0.0723 per query — a 42% reduction from the all-Sonnet baseline. At 250K queries/month: $18,075/month, saving $13,050. And the verification step actively improves faithfulness vs. the baseline, because we now have a second model checking work.

Why Haiku-class models are good enough for re-ranking

Re-ranking is the stage where engineers resist cheap models hardest, usually because they do not trust a small model to understand semantic relevance. Here is the empirical take from running this in production across a few domains:

Re-ranking is a scoring task, not a generation task. You are asking the model "on a scale of 1-10, how relevant is this chunk to this query." Small models are surprisingly good at this.
The input is structured. You control the prompt entirely. There is no creative latitude.
Top-N is forgiving. If your re-ranker gets chunk 3 and chunk 4 in the wrong order, synthesis usually recovers.
You can ensemble cheaply. Run two Haiku re-rankers with different prompts, average the scores. Still 1/10th the cost of a Sonnet re-ranker and often more stable.

For the specific case of re-ranking, I have seen Haiku-class models score within 2 points of Sonnet on NDCG@5 across multiple internal evals. That is noise-level.

Where frontier actually earns its cost

Synthesis is where you do not want to cheap out. The final answer is the product. That said, even synthesis has nuance:

Short-context synthesis (fewer than 5 chunks, clear query) — kr-auto with a medium quality threshold picks Sonnet or similar. Frontier is overkill.
Long-context synthesis (15+ chunks, nuanced query) — kr-auto with a higher threshold picks Opus or Gemini Pro. The extra cost is justified.
High-stakes synthesis (legal, medical, financial) — always frontier, and always with verification. No kr-auto, just explicit selection.

The how kr-auto works post covers how the quality thresholds get turned into model selection decisions. The short version: you declare the quality floor, the router picks the cheapest model that clears it.

The implementation pattern

python

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["KAIROSROUTE_API_KEY"],
    base_url="https://api.kairosroute.com/v1",
)

def rag_query(user_query: str, docs) -> dict:
    # Stage 1: Query expansion (fast/cheap)
    expanded = client.chat.completions.create(
        model="claude-haiku-4-5",  # explicit pin for predictability
        messages=[{"role": "user", "content": expand_prompt(user_query)}],
    )
    queries = parse_queries(expanded.choices[0].message.content)

    # Stage 2 + 3: Retrieve and re-rank
    candidates = vector_search(queries, docs, k=20)
    reranked = client.chat.completions.create(
        model="claude-haiku-4-5",
        messages=[{"role": "user", "content": rerank_prompt(user_query, candidates)}],
    )
    top_n = parse_ranking(reranked.choices[0].message.content)[:5]

    # Stage 4: Synthesis (kr-auto picks the cheapest that meets quality)
    answer = client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": synthesize_prompt(user_query, top_n)}],
        extra_body={"kr_quality": "high"},
    )

    # Stage 5: Verification (cheap classifier)
    verdict = client.chat.completions.create(
        model="claude-haiku-4-5",
        messages=[{"role": "user", "content": verify_prompt(answer, top_n)}],
    )

    if is_uncertain(verdict):
        # Stage 5b: Frontier second pass (rare)
        answer = client.chat.completions.create(
            model="claude-opus-4-1",
            messages=[{"role": "user", "content": synthesize_prompt(user_query, top_n)}],
        )

    return format_with_citations(answer, top_n)

Six LLM calls, three different model tiers, cheapest-where-possible. The same pipeline on all-Sonnet would cost 42% more.

What to measure

You cannot optimize what you cannot measure. The minimum RAG cost dashboard has:

Cost per query, broken down by stage.
Quality score per stage (synthesis + verification agreement rates, re-ranking NDCG@5 vs. a held-out set).
Verification-failure rate and the distribution of which queries trigger it.
Model mix per stage over time (so you notice when kr-auto shifts because a new model landed).

The receipts stream from the gateway gets you most of this for free. The rest is 50 lines of analytics code.

The throughline

RAG is not a single LLM problem. It is a pipeline of four to six LLM problems, each with different quality and cost profiles. The single-model reflex costs you 30-60% of your bill and adds zero quality. The stage-stratified pattern is the default architecture for production RAG in 2026.

If you want to prototype the pipeline, the playground lets you send the same prompt through kr-auto with different quality levels and see the model selection, cost, and latency change. Run your synthesis prompt through it and see where the default tier lands.

For the broader cost discipline that routing unlocks, The Unit Economics of AI Agents is the companion read.

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Calculate Your Savings Start Free — 100K tokens