K
KairosRoute
Blog/The Cheapest-Model-Per-Stage Pattern for Production RAG
Engineering10 min readKairosRoute

The Cheapest-Model-Per-Stage Pattern for Production RAG

The single biggest mistake I see in production RAG systems is that every stage of the pipeline runs on the same model. The team picked Claude Sonnet (or GPT-4o, or Opus) for synthesis because that is what matters for final answer quality, and then reflexively used the same model for query expansion, re-ranking, and verification. It is organizationally clean and economically catastrophic.

A well-designed RAG pipeline has four to six stages, each with different quality requirements. Treating them uniformly is overpaying by 4-8x. This post is the stage-by-stage pattern I use, with a concrete per-query cost breakdown for a production system at scale.

The stages of a real RAG pipeline

Forget the "retrieve, then generate" diagrams. A real production RAG pipeline for a reasonable document volume looks like:

  1. Query understanding / expansion. Rewrite the user's question into 2-4 retrieval queries that cover the semantic surface area.
  2. Retrieval. Hit your vector index (and probably a BM25 index). Return top-K candidates. No LLM here.
  3. Re-ranking. Score each candidate chunk for relevance against the query. Cut from top-K to top-N.
  4. Synthesis. Generate the final answer given the top-N chunks and the original query.
  5. Answer verification (optional but load-bearing). Check the generated answer for faithfulness to the source chunks. Flag or regenerate on failure.
  6. Citation formatting. Rewrite the answer with inline citations. Easy, deterministic.

Each of these has a different cost-quality tradeoff. The mistake is treating them as one.

The quality requirement for each stage

StageQuality floorWhyTarget tier
Query expansionMediumNeeds to generate plausible variants, not genius proseFast (Haiku-class)
RetrievalN/A (no LLM)Vector math, not inference
Re-rankingMediumScoring task; classifier-likeFast (Haiku-class)
SynthesisHighThe user-facing answer. Frontier matters.Mid-to-frontier
Answer verificationHighWrong verification is worse than no verificationFrontier
Citation formattingLowRule-based rewritingFast (Haiku-class)

Three of the six LLM stages can run on a fast/cheap model with no measurable quality loss. One requires frontier. Two are in between.

The concrete per-query cost breakdown

Let us do the full math for a representative production RAG system: knowledge-base assistant, 50K documents indexed, average user query runs through the full pipeline. Token counts are for a moderately complex query with 8 retrieved chunks of 1,200 tokens each.

Before: all stages on Claude Sonnet

StageModelIn tokensOut tokensCost per query
Query expansionSonnet180120$0.0023
Re-ranking (8 chunks)Sonnet9,600200$0.0319
SynthesisSonnet10,500650$0.0413
Answer verificationSonnet11,200150$0.0358
Citation formattingSonnet900700$0.0132
Per-query total$0.1245

$0.1245 per query. At 250K queries/month (typical for a mid-scale production system), that is $31,125/month in model costs.

After: stage-stratified with kr-auto

StageModelIn tokensOut tokensCost per query
Query expansionHaiku180120$0.0002
Re-ranking (8 chunks)Haiku9,600200$0.0028
SynthesisSonnet (kr-auto)10,500650$0.0413
Answer verificationOpus (high-stakes)11,200150$0.1792
Citation formattingHaiku900700$0.0010
Per-query total$0.2245

Wait — that went up. Because I moved verification from Sonnet to Opus, which is more expensive per token. Why would you ever do that?

The verification insight

This is the nuance most RAG cost guides miss. Verification is the highest-leverage stage you have. Done right, it lets you downgrade synthesis, because the verifier catches bad answers before they reach the user.

The actual pattern is:

StageModelIn tokensOut tokensCost per query
Query expansionHaiku180120$0.0002
Re-ranking (8 chunks)Haiku9,600200$0.0028
SynthesisHaiku (first pass)10,500650$0.0068
Answer verificationOpus11,200150$0.1792
Regenerate on verification fail (10% of queries)Sonnet10,500650$0.0041 amortized
Citation formattingHaiku900700$0.0010
Per-query total$0.1941

Still worse than the baseline. The Opus verification is eating all the savings. The trick is not Opus — it is Sonnet as the verifier with a strict faithfulness rubric, and treating verification as a classifier problem.

StageModelIn tokensOut tokensCost per query
Query expansionHaiku180120$0.0002
Re-ranking (8 chunks)Haiku9,600200$0.0028
SynthesisSonnet (kr-auto)10,500650$0.0413
Answer verification (pass/fail only)Haiku11,20030$0.0093
Frontier second-pass on uncertain (8% of queries)Opus11,350650$0.0177 amortized
Citation formattingHaiku900700$0.0010
Per-query total$0.0723

$0.0723 per query — a 42% reduction from the all-Sonnet baseline. At 250K queries/month: $18,075/month, saving $13,050. And the verification step actively improves faithfulness vs. the baseline, because we now have a second model checking work.

Why Haiku-class models are good enough for re-ranking

Re-ranking is the stage where engineers resist cheap models hardest, usually because they do not trust a small model to understand semantic relevance. Here is the empirical take from running this in production across a few domains:

  • Re-ranking is a scoring task, not a generation task. You are asking the model "on a scale of 1-10, how relevant is this chunk to this query." Small models are surprisingly good at this.
  • The input is structured. You control the prompt entirely. There is no creative latitude.
  • Top-N is forgiving. If your re-ranker gets chunk 3 and chunk 4 in the wrong order, synthesis usually recovers.
  • You can ensemble cheaply. Run two Haiku re-rankers with different prompts, average the scores. Still 1/10th the cost of a Sonnet re-ranker and often more stable.

For the specific case of re-ranking, I have seen Haiku-class models score within 2 points of Sonnet on NDCG@5 across multiple internal evals. That is noise-level.

Where frontier actually earns its cost

Synthesis is where you do not want to cheap out. The final answer is the product. That said, even synthesis has nuance:

  • Short-context synthesis (fewer than 5 chunks, clear query) — kr-auto with a medium quality threshold picks Sonnet or similar. Frontier is overkill.
  • Long-context synthesis (15+ chunks, nuanced query) — kr-auto with a higher threshold picks Opus or Gemini Pro. The extra cost is justified.
  • High-stakes synthesis (legal, medical, financial) — always frontier, and always with verification. No kr-auto, just explicit selection.

The how kr-auto works post covers how the quality thresholds get turned into model selection decisions. The short version: you declare the quality floor, the router picks the cheapest model that clears it.

The implementation pattern

python
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["KAIROSROUTE_API_KEY"],
    base_url="https://api.kairosroute.com/v1",
)

def rag_query(user_query: str, docs) -> dict:
    # Stage 1: Query expansion (fast/cheap)
    expanded = client.chat.completions.create(
        model="claude-haiku-4-5",  # explicit pin for predictability
        messages=[{"role": "user", "content": expand_prompt(user_query)}],
    )
    queries = parse_queries(expanded.choices[0].message.content)

    # Stage 2 + 3: Retrieve and re-rank
    candidates = vector_search(queries, docs, k=20)
    reranked = client.chat.completions.create(
        model="claude-haiku-4-5",
        messages=[{"role": "user", "content": rerank_prompt(user_query, candidates)}],
    )
    top_n = parse_ranking(reranked.choices[0].message.content)[:5]

    # Stage 4: Synthesis (kr-auto picks the cheapest that meets quality)
    answer = client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": synthesize_prompt(user_query, top_n)}],
        extra_body={"kr_quality": "high"},
    )

    # Stage 5: Verification (cheap classifier)
    verdict = client.chat.completions.create(
        model="claude-haiku-4-5",
        messages=[{"role": "user", "content": verify_prompt(answer, top_n)}],
    )

    if is_uncertain(verdict):
        # Stage 5b: Frontier second pass (rare)
        answer = client.chat.completions.create(
            model="claude-opus-4-1",
            messages=[{"role": "user", "content": synthesize_prompt(user_query, top_n)}],
        )

    return format_with_citations(answer, top_n)

Six LLM calls, three different model tiers, cheapest-where-possible. The same pipeline on all-Sonnet would cost 42% more.

What to measure

You cannot optimize what you cannot measure. The minimum RAG cost dashboard has:

  • Cost per query, broken down by stage.
  • Quality score per stage (synthesis + verification agreement rates, re-ranking NDCG@5 vs. a held-out set).
  • Verification-failure rate and the distribution of which queries trigger it.
  • Model mix per stage over time (so you notice when kr-auto shifts because a new model landed).

The receipts stream from the gateway gets you most of this for free. The rest is 50 lines of analytics code.

The throughline

RAG is not a single LLM problem. It is a pipeline of four to six LLM problems, each with different quality and cost profiles. The single-model reflex costs you 30-60% of your bill and adds zero quality. The stage-stratified pattern is the default architecture for production RAG in 2026.

If you want to prototype the pipeline, the playground lets you send the same prompt through kr-auto with different quality levels and see the model selection, cost, and latency change. Run your synthesis prompt through it and see where the default tier lands.

For the broader cost discipline that routing unlocks, The Unit Economics of AI Agents is the companion read.

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Related Reading

The Unit Economics of AI Agents: A Cost Model That Actually Works

AI agents scale 10–100x model calls per user action. If you don't have a per-ticket, per-task, or per-conversation cost model, you are running a business on vibes. Here's how to build one — and what it reveals.

What kr-auto Does (and Why It Beats Hand-Rolled Routing)

kr-auto picks the right model for every request, gets smarter from your own traffic, and gives you a receipt for the decision. Here is what that actually buys you — and why teams who try to roll their own spend six months getting it wrong.

Why a Dedicated LLM Gateway Is Inevitable in 2026

Every org that crosses ten LLM-using teams builds the same thing: a gateway. Rate limits, key rotation, audit logs, cost attribution, compliance. The question is not whether you need one. It is whether you build it or buy it. Here is the calc.