The State of Agent Infrastructure, 2026
Every year the AI industry produces a dozen "state of" reports that mostly cite each other. This is ours, and we've tried to make it useful rather than cite-able. The data comes from two places: our routing fabric, which dispatched roughly 1.2B tokens across 10 providers and 45+ models in Q1 2026, and the 600-plus onboarding and support conversations we've had with AI teams in the last 12 months.
The executive summary, in four lines
If you only read this far:
- The average production AI app runs on 4.1 models. Up from 2.3 a year ago. Multi-model is the default; single-model is the exception.
- 38% of teams now use some form of model routing. Up from 11%. The rest mostly still don't know what they're missing.
- Cost-per-outcome dropped ~60% YoY. Driven roughly half by falling token prices, half by better architecture (routing, caching, smaller models).
- Observability is the next bottleneck. Only 27% of teams can tell you, unprompted, what their AI costs per feature. That number is going up, and fast.
What teams are running
Model mix
The median team using KairosRoute routes across 4.1 distinct models, from a median of 2.7 providers. A year ago those numbers were 2.3 and 1.6 respectively. The "one provider, one model" era is over for anyone with real production traffic.
Number of models in production % of teams (Q1 2026) % a year ago ────────────────────────────────────────────────────────────────────── 1 model 14% 41% 2 models 21% 28% 3 models 24% 17% 4-5 models 27% 11% 6+ models 14% 3%
What's driving the fan-out? Two things. First, teams are segmenting by task — a fast model for classification, a balanced model for summarization, a frontier model for hard reasoning. Second, teams are segmenting by customer — enterprise tiers get premium-model routing; free tiers get cost-optimized routing. Both patterns are durable.
Workload shape
Rough distribution of production traffic types we see on our fabric:
Workload type % of requests % of dollar spend ───────────────────────────────────────────────────────────────────── Conversational (chatbots, etc) 34% 22% Extraction / structured output 22% 8% Classification / routing 15% 4% Summarization 11% 9% Agent tool-calling loops 9% 27% RAG-augmented generation 6% 12% Reasoning / multi-step workflows 3% 18%
Note the asymmetry at the bottom. Reasoning workloads are 3% of requests but 18% of spend. Agent loops are 9% of requests but 27% of spend. The tokens you spend in heavy-reasoning and multi-step-agent traffic dominate your bill, even though they're a small fraction of volume. If you're optimizing cost, you start there.
Context window usage
A year ago the median production request used about 2,100 input tokens. In Q1 2026, that number is 4,800 — more than double, in a year. RAG, tool-use prompts, and larger system messages all contribute. Output tokens, by contrast, are roughly flat at about 320 tokens median.
This is why prompt caching went from "nice to have" to table stakes. A 3-5x amplification of input tokens, if they're cacheable, turns into a 3-5x cost reduction. Teams that have rolled out prompt caching consistently report 30-50% cost savings on agent workloads.
Model mix trends
The balanced tier is eating the world
A year ago, a huge share of production traffic went to frontier models by default. Today, the median team uses frontier models for 19% of dispatched tokens and balanced-tier models (Sonnet 4.7, GPT-5 mini, Gemini 3 Pro) for 61%. Fast-tier models pick up the remaining 20%.
Tier share of dispatched tokens Q2 2025 Q1 2026 ──────────────────────────────────────────────────────── Frontier 41% 19% Balanced 38% 61% Fast 19% 20% Specialist 2% — (reported separately)
The frontier is where you spend when you have to, not where you live. Teams still use it for hard reasoning and genuinely difficult workloads, but defaulting to frontier for every request has become visibly wasteful as balanced-tier quality has closed the gap.
Open-weight models are real now
Open-weight models — Llama 4, Qwen3, DeepSeek, Mistral — accounted for 34% of dispatched tokens on our fabric in Q1 2026, up from 12% a year ago. This is not a story about ideology or cost alone; it's mostly about latency and availability. Groq's Llama at 85ms p50 is unavailable on any closed model. Providers like Fireworks and Together have made open weights operationally competitive for the first time.
Specialist models are growing fastest
Specialist models — coder models, vision models, embedding models, rerankers — grew from 2% of our fabric's tokens to 6% in twelve months. Growth rate is the highest of any tier. As teams get more sophisticated about architecture, specialists eat more of the workflow.
Observability adoption
We ask every team that onboards: can you tell me, right now, how much your AI costs per feature per month? The distribution:
Observability tier % of teams ───────────────────────────────────────────────────────────── Tier 0 — Invoice only. No instrumentation. 39% Tier 1 — Per-request logging, no aggregation. 23% Tier 2 — Dashboards by feature or task. 27% Tier 3 — Dashboards + quality signals + alerts. 11%
62% of AI teams cannot tell you, without running a script, what any given feature of their product costs to operate. That's a stat we expect to look absurd three years from now — and the reason we've been writing a lot about observability as the next APM.
The good news: Tier 3 adoption tripled in the last year (from ~4% to 11%). The teams leading this shift tend to be either explicitly cost-conscious startups or mature platform-engineering orgs at enterprise scale. The middle of the market is the laggard, which is usually how these shifts play out.
Cost-per-outcome improvements
The most interesting metric isn't cost per token; it's cost per outcome. For a fixed workload — "classify a support ticket," "summarize a document," "extract structured data from an email" — what did it cost a year ago, and what does it cost today?
Workload Cost/outcome Q2 2025 Cost/outcome Q1 2026 Change ────────────────────────────────────────────────────────────────────────────────────── Support-ticket classification $0.00071 $0.00014 -80% Product description summarization $0.00340 $0.00083 -76% Invoice field extraction $0.00180 $0.00041 -77% Long-document Q&A (RAG) $0.02600 $0.00790 -70% Multi-step agent task $0.11000 $0.04200 -62% Complex reasoning (single shot) $0.06400 $0.03600 -44%
Cost-per-outcome is dropping 40-80% a year across every workload category we measure. The compounding of token deflation, routing, caching, and smarter architectures is faster than the rate of growth for most businesses. A year-old cost model is essentially fiction. A two-year-old cost model is genuinely misleading.
Where the savings come from
Decomposing the 76% cost-per-outcome drop for "product description summarization" into contributing factors, based on our routing data:
- Raw token price deflation: ~24 percentage points
- Routing to cheaper-but-adequate models: ~31 percentage points
- Prompt caching adoption: ~13 percentage points
- Output-length reduction via better prompting: ~8 percentage points
Routing alone contributed more than raw price deflation. This is the shape we see across most workloads, and it's why the router became a wedge product in the first place. The teams that routed didn't just capture the deflation; they captured the deflation plus the mix-shift opportunity.
Predictions for 2027
Timestamped forecasts are what make industry reports worth reading a year later. Here are ours.
1. The average production AI app will use 6+ models by Q2 2027. We're at 4.1 today and the slope is steep. Model specialization plus tier-based routing keeps the count growing.
2. Routing crosses 60% adoption. From 38% today. Every large framework (LangChain, CrewAI, Vercel AI SDK) has first-class routing primitives now. The "build it yourself" friction is gone.
3. Token deflation persists at 30-45% YoY. We see no reason to expect a plateau. Capacity is growing faster than demand on most tiers except the absolute frontier.
4. "AI cost per feature" becomes a standard CFO metric. Driven by the observability tier 3 adoption wave. By end of 2027 we expect most Series B+ AI companies to report this internally.
5. At least one major outage event re-centers the conversation around multi-provider architecture. The market is three major frontier labs deep. At some point one of them has a rough day and everything depending on it does too.
6. Specialist models cross 15% of fabric tokens. They're the fastest-growing tier on our fabric and the math of "small focused model beats big general model" is well-understood now.
7. Outcome-based pricing from at least one frontier lab. Tokens as a billing unit are economically unstable for frontier providers. Expect experimentation with flat-rate, reservation, and outcome-based pricing models.
Ten quote-worthy lines we're willing to stake
Short, declarative things we believe, and have data for:
- The average production AI app runs on 4.1 models. Multi-model is the default now.
- Tokens are getting cheaper by ~42% a year. Your pricing should assume it.
- 62% of AI teams cannot tell you what any given feature costs to operate.
- The balanced tier is eating the world — 61% of routed tokens, up from 38% a year ago.
- Open-weight models hit 34% of production traffic. The ideological debate is over.
- Zero providers on our fabric had zero outages last quarter. Multi-provider is not premium; it's basic.
- Cost-per-outcome fell 60-80% in a year across most workload types.
- Routing contributed more to cost savings than raw price deflation did.
- Agent loops are 9% of requests and 27% of spend. That's where the money actually goes.
- Observability is the next bottleneck. You cannot optimize what you cannot see.
What this means if you're building
- Design for multi-model from day one. Single-provider is a transition state, not a destination.
- Instrument before you optimize. Every savings project is 10x more effective after you can see per-feature costs.
- Re-forecast quarterly. Your cost model has a 90-day half-life. Plan around it.
- Don't over-index on the frontier. Most of your traffic belongs on the balanced tier, and that truth is getting truer.
- Build failover. It's not a premium feature; it's what separates a product from an outage story.
Methodology & caveats
Two data sources feed this report.
Routing telemetry. About 1.2B billed tokens crossed our gateway in Q1 2026, spanning 10 providers and 45+ models. This gives us visibility into actual model choices, actual dispatched volumes, actual latencies, and actual costs. It does not give us visibility into traffic that never touched our fabric — so think of this as a sample of cost-sensitive, routing-curious teams, not a cross-section of the whole AI industry.
Onboarding interviews. We talk to teams during onboarding, during support escalations, and in quarterly check-ins. Over the past year we've catalogued roughly 620 of these conversations with light structured notes (model mix, observability tier, workflow types). This is where our adoption statistics come from. The sample skews toward teams that signed up for a router in the first place, which is a real selection bias.
- Selection bias. Our customers are disproportionately routing-curious. Industry-wide routing adoption is likely lower than the 38% we quote; the directional shape still holds.
- Workload classification. Task categories are assigned by our classifier, which is in the mid-90s accuracy on labeled test sets. Edge cases bleed between buckets.
- Cost-per-outcome calculations. We fix a workload definition and compute median cost for the cheapest model that cleared the quality bar. Different quality bars move the number. Ours is the kr-auto default, which most of our customers accept but not all.
- Regional skew. Our fabric routes primarily from US-East and EU-West. APAC and LATAM teams will see different latencies and, to a smaller degree, different mix patterns.
- Sample size variation. Adoption statistics are based on ~620 onboarding interviews. Routing statistics are based on billions of events. Treat the quantitative numbers (token volumes, tier share, latencies) with more confidence than the qualitative adoption statistics.
We'll publish this report annually, with quarterly updates on the most time-sensitive metrics in the LLM Cost Index and Provider Latency Leaderboard. If you're willing to share anonymized telemetry from your fabric, or want to review the methodology document in depth, reach out.
If you'd rather stop guessing at your own AI infrastructure and start measuring it, try the playground. The dashboard shows you everything this report covers, but about your own traffic.
Ready to route smarter?
KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.
Related Reading
Quarterly benchmark of median $/1M tokens across 10 providers and 45+ models, broken down by tier and task type. Plus our first read on the token deflation rate.
p50/p95/p99 time-to-first-token across 10 providers, regional variation, outage minutes, and a new latency-adjusted cost metric. Sourced from KairosRoute routing telemetry.
Application performance monitoring gave every engineering team a dashboard for what their services are doing. Agent observability is the same shift, happening now, for AI-native products. Here is the thesis.