Streaming LLM Responses with the Vercel AI SDK + KairosRoute
The Vercel AI SDK has quietly become the default way to ship streaming LLM UIs in Next.js. streamText, generateObject, and the useChat hook handle the parts that used to be painful: token-by-token streaming over HTTP, React Server Component streaming, structured output with Zod schemas, and multi-turn tool calls.
What the AI SDK does not do is pick the right model for each request. By default you hardcode openai('gpt-5') or anthropic('claude-sonnet-4.5') and pay frontier prices on every single call — including the ones that would have been just fine on Haiku or Gemini Flash. This guide wires KairosRoute into the AI SDK so every call routes to the cheapest model that meets your quality bar. The React side does not change.
Why this migration is two lines
The AI SDK's @ai-sdk/openai package is built around a createOpenAI factory. You can pass any baseURL you want. KairosRoute speaks the OpenAI wire protocol, so the SDK treats us as "OpenAI with a different URL and key." Every feature — streaming, tool calls, JSON mode, structured output, vision — flows through unchanged.
The one thing we change is the model string. Pass "kr-auto" to get cost-aware routing, or pass any model ID from our 45-model registry ("claude-sonnet-4.5", "gpt-5", "gemini-2.5-pro", etc.) to pin a specific model. Zero markup on provider costs — we take a flat gateway fee against your plan's token allotment.
Step 1: Create the KairosRoute provider
npm install ai @ai-sdk/openai zod
Create a tiny module that builds a shared KairosRoute-backed OpenAI provider. Do this once; import it everywhere.
// src/lib/kr.ts
import { createOpenAI } from '@ai-sdk/openai';
export const kr = createOpenAI({
baseURL: 'https://api.kairosroute.com/v1',
apiKey: process.env.KAIROSROUTE_API_KEY!,
// The SDK adds "openai" as the provider ID in traces.
// Override if you want traces to say "kairosroute":
name: 'kairosroute',
});That is the whole provider setup. Now kr('kr-auto') returns a model object you can hand to any AI SDK function.
Step 2: Routed chat completions
Here is a standard Next.js route handler that streams a chat response to the browser. Note that nothing about this code is KairosRoute-specific beyond the import — the SDK does not know or care where the tokens come from.
// src/app/api/chat/route.ts
import { streamText } from 'ai';
import { kr } from '@/lib/kr';
export const runtime = 'edge';
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: kr('kr-auto'),
messages,
system: 'You are a concise, friendly assistant.',
});
return result.toDataStreamResponse();
}Every incoming message is classified and routed server-side. A casual one-liner might land on Haiku. A dense technical question might land on Sonnet. A coding task might land on GPT-5. You pay provider cost on each, plus our thin gateway fee.
The React side stays identical
// src/app/chat/page.tsx
'use client';
import { useChat } from 'ai/react';
export default function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat();
return (
<div className="mx-auto max-w-2xl p-6">
{messages.map((m) => (
<div key={m.id} className="mb-4">
<strong>{m.role}:</strong> {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={handleInputChange}
placeholder="Ask anything..."
disabled={isLoading}
className="w-full rounded border px-3 py-2"
/>
</form>
</div>
);
}Token-by-token streaming works out of the box. KairosRoute streams Server-Sent Events from the upstream provider directly to your route handler, which pipes them to the browser. First-token latency is typically under 300ms on fast models and under 900ms on frontier ones — the router adds roughly 8–15ms to pick the model.
Step 3: Tool calling (multi-step)
The AI SDK makes multi-step tool-use ergonomic with tools and maxSteps. These work through KairosRoute without modification. The router notices tool schemas in the request and biases toward models that score well on tool use in our internal evals.
import { streamText, tool } from 'ai';
import { z } from 'zod';
import { kr } from '@/lib/kr';
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: kr('kr-auto'),
messages,
tools: {
getWeather: tool({
description: 'Get current weather for a city.',
parameters: z.object({ city: z.string() }),
execute: async ({ city }) => {
const data = await fetch(`https://weather.example/${city}`).then((r) =>
r.json(),
);
return { tempF: data.temp, conditions: data.conditions };
},
}),
searchOrders: tool({
description: 'Look up a customer order by ID.',
parameters: z.object({ orderId: z.string() }),
execute: async ({ orderId }) => db.orders.find(orderId),
}),
},
maxSteps: 5,
});
return result.toDataStreamResponse();
}The SDK handles the tool-call loop automatically. Each step's completion is routed independently — the first step might pick a tool-reliable model, the final synthesis step might pick something cheaper. You can inspect which model served each step in the response metadata.
Step 4: Structured output with generateObject
generateObject is the SDK's best-kept secret. You hand it a Zod schema, and it returns a validated, typed object. KairosRoute translates OpenAI's JSON-schema mode into each provider's native equivalent, so your Zod schema works regardless of which model the router picks.
import { generateObject } from 'ai';
import { z } from 'zod';
import { kr } from '@/lib/kr';
const TicketSchema = z.object({
priority: z.enum(['low', 'medium', 'high', 'urgent']),
category: z.enum(['billing', 'auth', 'bug', 'feature', 'other']),
summary: z.string().max(140),
suggestedReply: z.string(),
});
export async function classifyTicket(body: string) {
const { object } = await generateObject({
model: kr('kr-auto'),
schema: TicketSchema,
prompt: `Classify this customer message:\n\n${body}`,
});
return object; // fully typed: { priority, category, summary, suggestedReply }
}Structured-output workloads are where routing pays off the hardest. Classification and extraction rarely need frontier reasoning, so kr-auto almost always picks a Haiku/Flash-class model — and you get the exact same validated Zod object.
Step 5: React Server Components streaming
The AI SDK's streamUI / createStreamableUI primitives let you stream React components from a server action straight into your client tree. This also works over KairosRoute.
// src/app/actions.tsx
'use server';
import { streamUI } from 'ai/rsc';
import { z } from 'zod';
import { kr } from '@/lib/kr';
import { WeatherCard, OrderTable, Spinner } from '@/components/stream-ui';
export async function ask(question: string) {
const result = await streamUI({
model: kr('kr-auto'),
prompt: question,
text: ({ content }) => <div>{content}</div>,
tools: {
showWeather: {
description: 'Render a weather card.',
parameters: z.object({ city: z.string() }),
generate: async function* ({ city }) {
yield <Spinner />;
const data = await fetch(`https://weather.example/${city}`).then(
(r) => r.json(),
);
return <WeatherCard city={city} {...data} />;
},
},
showOrders: {
description: 'Render a table of recent orders.',
parameters: z.object({ customerId: z.string() }),
generate: async function* ({ customerId }) {
yield <Spinner />;
const orders = await db.orders.forCustomer(customerId);
return <OrderTable orders={orders} />;
},
},
},
});
return result.value;
}The client calls ask('What orders did customer 42 place?'), and React streams a <Spinner /> then an <OrderTable /> into place. The LLM decision of which component to render is routed through kr-auto — you get cost savings even on UI-generation workloads.
Before and after: a real AI SDK cost comparison
Illustrative numbers from a Next.js SaaS product using streamText for its in-app assistant (~120K calls/month):
- Before (pinned to GPT-5): ~800 tokens avg × 120K = 96M tokens × GPT-5 blended = roughly
$1,900/month. - After (
kr-auto): router distribution across chat, classify, tool-use, and summarize. Blended cost ≈$380/month. - Savings: 80%. p50 latency actually dropped — lighter models answered faster on most requests.
A common follow-on change is adding generateObject-based ticket triage or email classification. Those classification jobs are where routing shines brightest; expect 90%+ savings on those specific endpoints versus pinning a frontier model.
Gotchas we have seen in the field
Edge runtime vs. Node runtime
KairosRoute works on both. runtime = 'edge' is great for latency but your tool execute functions cannot use Node-only APIs. If a tool needs the Node runtime, set runtime = 'nodejs' on that route — streaming still works.
Tracking the actual model picked
The SDK's onFinish callback surfaces raw response metadata. Every KairosRoute response includes kr_routed_model — log it to your analytics pipe so you can slice cost and latency by actual-model-used, not just by kr-auto.
const result = streamText({
model: kr('kr-auto'),
messages,
onFinish: ({ rawResponse, usage }) => {
const routed = rawResponse?.headers?.['x-kr-routed-model'];
analytics.track('llm_call', {
routedModel: routed,
promptTokens: usage.promptTokens,
completionTokens: usage.completionTokens,
});
},
});Environment variable on Vercel
Set KAIROSROUTE_API_KEY in the Vercel project settings, not just locally. For preview deployments, give each environment its own key so you can revoke preview keys without touching production.
Client-side vs. server-side
Never put your KairosRoute key in a client component — the AI SDK is designed so the key lives on the server, and the browser only talks to your route handlers. If you are doing something exotic, use per-user scoped keys from our POST /v1/keys endpoint and rotate them aggressively.
Request-scoped model override
If a specific route needs a pinned model (say, a legal-doc summarizer that must use Sonnet), pass the model ID directly instead of kr-auto. You can mix and match in the same app — most routes on kr-auto, a few critical ones pinned.
// Critical route: pin to Sonnet for deterministic behavior.
const result = streamText({
model: kr('claude-sonnet-4.5'),
messages,
});Rollout checklist
- Add
KAIROSROUTE_API_KEYto your environment. Create the sharedkrprovider insrc/lib/kr.ts. - Swap
openai('gpt-5')tokr('kr-auto')in one route handler. Deploy to preview. Verify the stream still works and the dashboard shows traffic. - Do the same for every other
streamText,generateText,generateObject, andstreamUIcall. Keep pinned models only where you need them. - Add
onFinishlogging ofx-kr-routed-modelso you can audit the routing distribution. - Run your eval suite (you have one, right?). Compare quality against the pinned-GPT-5 baseline. Cut over production when the numbers look right.
Related reading
If you are building long-running chains instead of per-request streams, see the LangChain cost routing guide. For multi-agent workloads, the CrewAI per-agent routing guide covers the same pattern at a higher level. The shortest possible migration is the OpenAI Migration Guide.
Try it in your Next.js app
The playground will let you paste an AI SDK prompt and watch the routed stream in real time. Full end-to-end recipes — including a copy-paste Next.js starter template — live at docs/migration.
Ready to route smarter?
KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.
Related Reading
LangChain already uses ChatOpenAI as its default LLM wrapper. Point it at KairosRoute, set model="kr-auto", and every chain, agent, and LCEL pipeline in your app starts routing to the cheapest model that meets your quality bar — no refactor required.
Already using the OpenAI SDK? Switching to KairosRoute takes two lines of code — change your base URL and API key. Everything else (streaming, tools, JSON mode, vision) stays the same. Here is the walkthrough in Python, TypeScript, Go, and curl.
A Researcher does not need the same model as a Writer. In CrewAI you can assign a different LLM to every agent — give your Researcher kr-auto for cheap bulk work, your Writer a frontier model for the final draft, and your Reviewer Haiku for fast critique. Here is the pattern.