Add Cost-Aware Routing to Your LangChain App in 10 Minutes
If your LangChain app is built around ChatOpenAI — and most of them are — you are ten minutes away from cutting 50–85% off your model spend. No new abstractions. No rewriting your chains. You change two constructor arguments, deploy, and the routing starts working on the very next request.
This guide walks through the full drop-in migration: basic chat, LCEL pipelines, tool calling, streaming, and the gotchas that trip people up (callbacks, token counting, and what happens when a provider goes down mid-stream). We will end with before/after cost numbers from a representative production chain.
Why LangChain + KairosRoute works
LangChain's ChatOpenAI wrapper does not actually require OpenAI. It just speaks the OpenAI wire protocol. KairosRoute exposes the exact same protocol at https://api.kairosroute.com/v1, so LangChain treats us as "OpenAI with a different base URL." All of the features you rely on — function calling, structured output, streaming deltas, with_structured_output, bind_tools — flow through unchanged.
The difference is what happens server-side. When you set model="kr-auto", we classify the incoming request (code vs. reasoning vs. extraction vs. chat vs. summarization vs. tool use), score it against our live price feed for all 45+ models, and dispatch to the cheapest one that clears the quality threshold for that task type. You get one invoice, one dashboard, and zero markup on provider costs — you pay Anthropic what Anthropic charges, OpenAI what OpenAI charges, and we take a thin gateway fee.
Step 1: Install and swap your client
You probably already have langchain-openai. If not:
pip install langchain langchain-openai
Here is the before-and-after. Your existing code:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o",
api_key=os.environ["OPENAI_API_KEY"],
)becomes:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="kr-auto",
base_url="https://api.kairosroute.com/v1",
api_key=os.environ["KAIROSROUTE_API_KEY"],
)That is the entire migration. Every chain, every agent, every retriever that consumed llm now routes through KairosRoute. You can also pin a specific model if you want deterministic routing — pass model="claude-sonnet-4.5" or model="gpt-5-mini" or any of the 45+ IDs in our registry.
Step 2: Wire it into an LCEL chain
LCEL is LangChain's composition operator — the | pipe. Your chains do not change at all. Here is a RAG-style pipeline with a prompt template, the routed LLM, and an output parser:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="kr-auto",
base_url="https://api.kairosroute.com/v1",
api_key=os.environ["KAIROSROUTE_API_KEY"],
)
prompt = ChatPromptTemplate.from_messages([
("system", "You answer customer-support questions using the provided context."),
("human", "Context:\n{context}\n\nQuestion: {question}"),
])
chain = prompt | llm | StrOutputParser()
answer = chain.invoke({
"context": retrieved_docs_text,
"question": "How do I export my data?",
})Every invocation is routed. A short extraction question might land on Haiku or Gemini Flash for a fraction of a cent. A gnarly multi-step reasoning query might land on Sonnet or GPT-5. You do not have to think about it — the router does.
Step 3: Tool calling still works
This is the part most developers worry about. The answer: tool calling is part of the OpenAI wire protocol, so it works identically. Use bind_tools exactly as you would with OpenAI:
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
@tool
def get_weather(city: str) -> str:
"""Return current weather for a city."""
return fetch_weather(city)
@tool
def search_orders(customer_id: str) -> list[dict]:
"""Look up recent orders for a customer."""
return db.query_orders(customer_id)
llm = ChatOpenAI(
model="kr-auto",
base_url="https://api.kairosroute.com/v1",
api_key=os.environ["KAIROSROUTE_API_KEY"],
)
llm_with_tools = llm.bind_tools([get_weather, search_orders])
result = llm_with_tools.invoke("What orders did customer 42 place this week?")
# result.tool_calls -> [{"name": "search_orders", "args": {"customer_id": "42"}, ...}]Under the hood, the router notices the tools in the request and biases toward models that have strong tool-use track records on our internal eval harness. If you are on kr-auto, you will typically land on Sonnet, GPT-5, or Gemini 2.5 Pro for tool-heavy calls — never on a model that fumbles function-calling JSON.
Step 4: Streaming deltas
Streaming is load-bearing for chat UIs. KairosRoute streams Server-Sent Events in OpenAI's exact format, so .stream() and .astream() both work:
async for chunk in llm.astream("Explain how vector embeddings work."):
print(chunk.content, end="", flush=True)First-token latency is typically under 300ms for fast models (Haiku, Flash, Groq-hosted Llama) and under 900ms for frontier models. We stream directly from the upstream provider — we do not buffer on our side — so there is no added delay past the initial routing decision (typically 8–15ms).
Before and after: a real LangChain cost comparison
Here is a representative workload from a customer-support LangChain agent that processes around 80,000 messages per month. Numbers are illustrative but taken from a real migration we helped with.
- Before (pinned to GPT-4o): 80K messages × ~1,400 tokens avg (in+out) × GPT-4o blended price = roughly
$420/month. - After (
kr-auto): router sends ~55% of messages to Haiku/Flash, ~30% to Sonnet, ~15% to GPT-5 for the genuinely hard ones. Blended cost = roughly$84/month. - Savings: 80%. Quality score on the eval harness stayed within 1.2 points of the pinned-GPT-4o baseline.
Because we charge zero markup on the provider bill, that savings flows directly to you. We make our money on the flat gateway fee (at the Team tier: $99/mo for 10M tokens included, $0.40 per 1M over). For a workload of this size, gateway fees are a small fraction of the provider savings.
Gotchas we have seen in the field
Callbacks and LangSmith tracing
LangSmith sees your request as it leaves the ChatOpenAI wrapper — it does not know or care that the base URL is ours. Traces record the model field you passed (kr-auto). If you want LangSmith traces to show which actual model the router picked, read the response.response_metadata["kr_routed_model"] field we populate on every response and log it yourself:
from langchain_core.callbacks import BaseCallbackHandler
class RoutingLogger(BaseCallbackHandler):
def on_llm_end(self, response, **kwargs):
for gen in response.generations[0]:
meta = gen.message.response_metadata
print(f"routed to: {meta.get('kr_routed_model')}")
llm = ChatOpenAI(
model="kr-auto",
base_url="https://api.kairosroute.com/v1",
api_key=os.environ["KAIROSROUTE_API_KEY"],
callbacks=[RoutingLogger()],
)Token counting
LangChain's get_openai_callback() context manager counts tokens using OpenAI's tokenizer. If kr-auto picks a non-OpenAI model (Claude, Gemini, Llama), the local token count will be slightly off. The billable count — the one on your invoice — is what we return in response.usage, which we compute using each model's native tokenizer. Always trust the server-side count for billing.
Structured output
with_structured_output() uses OpenAI's JSON schema mode under the hood. KairosRoute translates this to each provider's native equivalent (Anthropic tool-use JSON, Gemini responseSchema, etc.) so the same Pydantic model works across routes:
from pydantic import BaseModel
class Ticket(BaseModel):
priority: str
category: str
summary: str
structured_llm = llm.with_structured_output(Ticket)
result = structured_llm.invoke("Customer cannot log in, tried password reset twice.")
# result -> Ticket(priority="high", category="auth", summary="...")Retries and fallbacks
LangChain's .with_fallbacks() still works if you want belt-and-suspenders. But KairosRoute already does provider-level fallback internally: if the primary route 5xxs or times out, we transparently retry on the next-cheapest model that meets the quality bar. You usually do not need LangChain-level fallbacks on top of ours.
A full working example: a routed RAG chain
Here is an end-to-end LCEL chain: embed, retrieve, answer, structured output — all routed.
import os
from pydantic import BaseModel
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
llm = ChatOpenAI(
model="kr-auto",
base_url="https://api.kairosroute.com/v1",
api_key=os.environ["KAIROSROUTE_API_KEY"],
)
# Embeddings can also be routed through KairosRoute.
embeddings = OpenAIEmbeddings(
model="kr-embed-auto",
base_url="https://api.kairosroute.com/v1",
api_key=os.environ["KAIROSROUTE_API_KEY"],
)
vectorstore = Chroma(persist_directory="./kb", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
class Answer(BaseModel):
answer: str
sources: list[str]
confidence: float
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the provided context. Cite sources."),
("human", "Context:\n{context}\n\nQuestion: {question}"),
])
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm.with_structured_output(Answer)
)
result = chain.invoke("What is the refund policy for annual plans?")
print(result.answer, result.sources, result.confidence)That chain used to cost a particular startup around $1,100/month pinned to GPT-4o. On kr-auto, same quality bar, it runs at $210/month. The migration took one afternoon.
Production rollout checklist
- Swap the
ChatOpenAIconstructor arguments in a single config module so you can toggle the base URL with an env var. - Deploy to staging with
model="kr-auto". Run your existing LangChain eval suite. Compare aggregate quality scores to the pinned-OpenAI baseline. - Shadow traffic for a day. Log
response_metadata["kr_routed_model"]to your analytics pipe so you can see the routing distribution. - Cut over production. Watch the KairosRoute dashboard for the first 24 hours — savings, p95 latency, error rate, and routing breakdown are all visible.
- If a specific chain has a strict quality bar, pin it to a named model (
claude-sonnet-4.5,gpt-5) and let everything else stay onkr-auto.
Related reading
If you are curious about the routing machinery, read How kr-auto Works. If you came here from vanilla OpenAI and want the bare-metal version, see the OpenAI Migration Guide. Running multi-agent systems? The CrewAI per-agent routing guide shows how to assign a different model policy to each crew member.
Ready to try it?
Open the playground and paste your LangChain prompt — you will see the routed model, the latency, and the cost breakdown in real time. When you are ready to migrate for real, the full step-by-step is at docs/migration.
Ready to route smarter?
KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.
Related Reading
Already using the OpenAI SDK? Switching to KairosRoute takes two lines of code — change your base URL and API key. Everything else (streaming, tools, JSON mode, vision) stays the same. Here is the walkthrough in Python, TypeScript, Go, and curl.
kr-auto picks the right model for every request, gets smarter from your own traffic, and gives you a receipt for the decision. Here is what that actually buys you — and why teams who try to roll their own spend six months getting it wrong.
A Researcher does not need the same model as a Writer. In CrewAI you can assign a different LLM to every agent — give your Researcher kr-auto for cheap bulk work, your Writer a frontier model for the final draft, and your Reviewer Haiku for fast critique. Here is the pattern.