Blog/Add Cost-Aware Routing to Your LangChain App in 10 Minutes

MigrationApril 16, 2026•9 min read•KairosRoute

Add Cost-Aware Routing to Your LangChain App in 10 Minutes

If your LangChain app is built around ChatOpenAI — and most of them are — you are ten minutes away from cutting 50–85% off your model spend. No new abstractions. No rewriting your chains. You change two constructor arguments, deploy, and the routing starts working on the very next request.

This guide walks through the full drop-in migration: basic chat, LCEL pipelines, tool calling, streaming, and the gotchas that trip people up (callbacks, token counting, and what happens when a provider goes down mid-stream). We will end with before/after cost numbers from a representative production chain.

Why LangChain + KairosRoute works

LangChain's ChatOpenAI wrapper does not actually require OpenAI. It just speaks the OpenAI wire protocol. KairosRoute exposes the exact same protocol at https://api.kairosroute.com/v1, so LangChain treats us as "OpenAI with a different base URL." All of the features you rely on — function calling, structured output, streaming deltas, with_structured_output, bind_tools — flow through unchanged.

The difference is what happens server-side. When you set model="kr-auto", we classify the incoming request (code vs. reasoning vs. extraction vs. chat vs. summarization vs. tool use), score it against our live price feed for all 45+ models, and dispatch to the cheapest one that clears the quality threshold for that task type. You get one invoice, one dashboard, and zero markup on provider costs — you pay Anthropic what Anthropic charges, OpenAI what OpenAI charges, and we take a thin gateway fee.

Step 1: Install and swap your client

You probably already have langchain-openai. If not:

bash

pip install langchain langchain-openai

Here is the before-and-after. Your existing code:

python

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    api_key=os.environ["OPENAI_API_KEY"],
)

becomes:

python

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="kr-auto",
    base_url="https://api.kairosroute.com/v1",
    api_key=os.environ["KAIROSROUTE_API_KEY"],
)

That is the entire migration. Every chain, every agent, every retriever that consumed llm now routes through KairosRoute. You can also pin a specific model if you want deterministic routing — pass model="claude-sonnet-4.5" or model="gpt-5-mini" or any of the 45+ IDs in our registry.

Step 2: Wire it into an LCEL chain

LCEL is LangChain's composition operator — the | pipe. Your chains do not change at all. Here is a RAG-style pipeline with a prompt template, the routed LLM, and an output parser:

python

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="kr-auto",
    base_url="https://api.kairosroute.com/v1",
    api_key=os.environ["KAIROSROUTE_API_KEY"],
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You answer customer-support questions using the provided context."),
    ("human", "Context:\n{context}\n\nQuestion: {question}"),
])

chain = prompt | llm | StrOutputParser()

answer = chain.invoke({
    "context": retrieved_docs_text,
    "question": "How do I export my data?",
})

Every invocation is routed. A short extraction question might land on Haiku or Gemini Flash for a fraction of a cent. A gnarly multi-step reasoning query might land on Sonnet or GPT-5. You do not have to think about it — the router does.

Step 3: Tool calling still works

This is the part most developers worry about. The answer: tool calling is part of the OpenAI wire protocol, so it works identically. Use bind_tools exactly as you would with OpenAI:

python

from langchain_core.tools import tool
from langchain_openai import ChatOpenAI

@tool
def get_weather(city: str) -> str:
    """Return current weather for a city."""
    return fetch_weather(city)

@tool
def search_orders(customer_id: str) -> list[dict]:
    """Look up recent orders for a customer."""
    return db.query_orders(customer_id)

llm = ChatOpenAI(
    model="kr-auto",
    base_url="https://api.kairosroute.com/v1",
    api_key=os.environ["KAIROSROUTE_API_KEY"],
)

llm_with_tools = llm.bind_tools([get_weather, search_orders])

result = llm_with_tools.invoke("What orders did customer 42 place this week?")
# result.tool_calls -> [{"name": "search_orders", "args": {"customer_id": "42"}, ...}]

Under the hood, the router notices the tools in the request and biases toward models that have strong tool-use track records on our internal eval harness. If you are on kr-auto, you will typically land on Sonnet, GPT-5, or Gemini 2.5 Pro for tool-heavy calls — never on a model that fumbles function-calling JSON.

Step 4: Streaming deltas

Streaming is load-bearing for chat UIs. KairosRoute streams Server-Sent Events in OpenAI's exact format, so .stream() and .astream() both work:

python

async for chunk in llm.astream("Explain how vector embeddings work."):
    print(chunk.content, end="", flush=True)

First-token latency is typically under 300ms for fast models (Haiku, Flash, Groq-hosted Llama) and under 900ms for frontier models. We stream directly from the upstream provider — we do not buffer on our side — so there is no added delay past the initial routing decision (typically 8–15ms).

Before and after: a real LangChain cost comparison

Here is a representative workload from a customer-support LangChain agent that processes around 80,000 messages per month. Numbers are illustrative but taken from a real migration we helped with.

Before (pinned to GPT-4o): 80K messages × ~1,400 tokens avg (in+out) × GPT-4o blended price = roughly $420/month.
After (kr-auto): router sends ~55% of messages to Haiku/Flash, ~30% to Sonnet, ~15% to GPT-5 for the genuinely hard ones. Blended cost = roughly $84/month.
Savings: 80%. Quality score on the eval harness stayed within 1.2 points of the pinned-GPT-4o baseline.

Because we charge zero markup on the provider bill, that savings flows directly to you. We make our money on the flat gateway fee (at the Team tier: $99/mo for 10M tokens included, $0.40 per 1M over). For a workload of this size, gateway fees are a small fraction of the provider savings.

Gotchas we have seen in the field

Callbacks and LangSmith tracing

LangSmith sees your request as it leaves the ChatOpenAI wrapper — it does not know or care that the base URL is ours. Traces record the model field you passed (kr-auto). If you want LangSmith traces to show which actual model the router picked, read the response.response_metadata["kr_routed_model"] field we populate on every response and log it yourself:

python

from langchain_core.callbacks import BaseCallbackHandler

class RoutingLogger(BaseCallbackHandler):
    def on_llm_end(self, response, **kwargs):
        for gen in response.generations[0]:
            meta = gen.message.response_metadata
            print(f"routed to: {meta.get('kr_routed_model')}")

llm = ChatOpenAI(
    model="kr-auto",
    base_url="https://api.kairosroute.com/v1",
    api_key=os.environ["KAIROSROUTE_API_KEY"],
    callbacks=[RoutingLogger()],
)

Token counting

LangChain's get_openai_callback() context manager counts tokens using OpenAI's tokenizer. If kr-auto picks a non-OpenAI model (Claude, Gemini, Llama), the local token count will be slightly off. The billable count — the one on your invoice — is what we return in response.usage, which we compute using each model's native tokenizer. Always trust the server-side count for billing.

Structured output

with_structured_output() uses OpenAI's JSON schema mode under the hood. KairosRoute translates this to each provider's native equivalent (Anthropic tool-use JSON, Gemini responseSchema, etc.) so the same Pydantic model works across routes:

python

from pydantic import BaseModel

class Ticket(BaseModel):
    priority: str
    category: str
    summary: str

structured_llm = llm.with_structured_output(Ticket)
result = structured_llm.invoke("Customer cannot log in, tried password reset twice.")
# result -> Ticket(priority="high", category="auth", summary="...")

Retries and fallbacks

LangChain's .with_fallbacks() still works if you want belt-and-suspenders. But KairosRoute already does provider-level fallback internally: if the primary route 5xxs or times out, we transparently retry on the next-cheapest model that meets the quality bar. You usually do not need LangChain-level fallbacks on top of ours.

A full working example: a routed RAG chain

Here is an end-to-end LCEL chain: embed, retrieve, answer, structured output — all routed.

python

import os
from pydantic import BaseModel
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

llm = ChatOpenAI(
    model="kr-auto",
    base_url="https://api.kairosroute.com/v1",
    api_key=os.environ["KAIROSROUTE_API_KEY"],
)

# Embeddings can also be routed through KairosRoute.
embeddings = OpenAIEmbeddings(
    model="kr-embed-auto",
    base_url="https://api.kairosroute.com/v1",
    api_key=os.environ["KAIROSROUTE_API_KEY"],
)

vectorstore = Chroma(persist_directory="./kb", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

class Answer(BaseModel):
    answer: str
    sources: list[str]
    confidence: float

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using only the provided context. Cite sources."),
    ("human", "Context:\n{context}\n\nQuestion: {question}"),
])

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm.with_structured_output(Answer)
)

result = chain.invoke("What is the refund policy for annual plans?")
print(result.answer, result.sources, result.confidence)

That chain used to cost a particular startup around $1,100/month pinned to GPT-4o. On kr-auto, same quality bar, it runs at $210/month. The migration took one afternoon.

Production rollout checklist

Swap the ChatOpenAI constructor arguments in a single config module so you can toggle the base URL with an env var.
Deploy to staging with model="kr-auto". Run your existing LangChain eval suite. Compare aggregate quality scores to the pinned-OpenAI baseline.
Shadow traffic for a day. Log response_metadata["kr_routed_model"] to your analytics pipe so you can see the routing distribution.
Cut over production. Watch the KairosRoute dashboard for the first 24 hours — savings, p95 latency, error rate, and routing breakdown are all visible.
If a specific chain has a strict quality bar, pin it to a named model (claude-sonnet-4.5, gpt-5) and let everything else stay on kr-auto.

Ready to try it?

Open the playground and paste your LangChain prompt — you will see the routed model, the latency, and the cost breakdown in real time. When you are ready to migrate for real, the full step-by-step is at docs/migration.

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Calculate Your Savings Start Free — 100K tokens