The Unit Economics of AI Agents: A Cost Model That Actually Works
There is a specific question every AI founder should be able to answer in their sleep: what is the model-API cost to serve one unit of value?
For a support agent, that's cost-per-ticket. For a code assistant, cost-per-PR. For a research tool, cost-per-report. For a coaching agent, cost-per-session. The unit varies; the discipline is the same. If you can't answer within ±30%, your unit economics aren't a spreadsheet — they're a vibe. And vibes don't scale.
This post walks through a cost model we've refined across dozens of KairosRoute customer calls. It generalizes; you can apply it to your product today.
Why agents are different
A single-shot chatbot makes one model call per user message. Easy cost math: one call × average tokens × model price = per-message cost. Multiply by messages/user. Done.
Agents don't work that way. An agent makes a loop of calls — plan, dispatch tools, observe, revise, plan, dispatch, observe, revise — until the goal is met. 10x multiplier is common. 100x isn't rare. We have customers whose p95 agent trace hits 180 model calls.
This matters for two reasons. First, the absolute cost per unit is often much larger than people assume. A support ticket that felt like "one API call" is actually 30 API calls across reasoning, tool use, memory updates, and summary generation. Second, the variance is huge. A well-defined ticket might take 8 calls; an ambiguous one might take 80. If your pricing is flat, you eat the variance.
The cost model, in spreadsheet form
Five numbers define your per-unit cost:
- N = number of model calls per unit (avg, p50, p95).
- I = average input tokens per call.
- O = average output tokens per call.
- P_in = input price per 1M tokens for the routed model.
- P_out = output price per 1M tokens for the routed model.
Cost_per_unit = N × (I × P_in / 1M + O × P_out / 1M)
Example: a support agent
N = 25 calls/ticket (avg)
I = 1,200 tokens avg input (includes loaded context)
O = 240 tokens avg output
P_in = $3 /1M (Sonnet)
P_out = $15 /1M (Sonnet)
Cost/ticket = 25 × (1200 × 3/1M + 240 × 15/1M)
= 25 × ($0.0036 + $0.0036)
= 25 × $0.0072
= $0.18 per ticket
Gross margin at $4/ticket revenue = ($4 - $0.18) / $4 = 95.5%
Gross margin at $0.50/ticket revenue = ($0.50 - $0.18) / $0.50 = 64%
Gross margin at $0.20/ticket revenue = ($0.20 - $0.18) / $0.20 = 10% (DANGER)This looks trivial. Most teams have never done it. Three reasons they dodge it: (1) they don't track N per unit, (2) their task distribution is skewed so the average lies, (3) they haven't picked a stable routing policy so prices change week-to-week.
The N distribution matters more than the average
We have a customer running a research agent. Their average is 18 calls per research task, looked reasonable, cost/task was ~$0.30. Their p95 was 180 calls, cost $3.10. Their customer tier was $5/task. The long tail was eating 60% of margin on 5% of tasks.
The fix was two-part. First, a call-budget safety net: agent tasks that exceed 50 calls trigger a human-in-the-loop check rather than burning another 130 calls. Second, routing — the long-tail tasks got downgraded from Opus to Sonnet once the classifier figured out the bulk of the work was synthesis, not frontier reasoning.
You can only make these fixes once you have the distribution visible. Not the average.
The tokens-per-call trap
A common self-deception: "my prompts are small." In an agent, the effective prompt on call N is the accumulated context from calls 1 through N-1. By the 20th step, your "small" prompt is 8K tokens of memory, tool outputs, and intermediate reasoning.
This is why input tokens dominate cost in most agent workloads. Output tokens are small and bounded; input tokens grow with trace depth. The two levers that actually move input tokens are:
- Summarization checkpoints. Every 10 steps, collapse the history to a running summary. Reduces input growth from O(n) to O(log n).
- Prompt caching. If 80% of the prompt is a stable system message + tool schemas, provider-side caching drops the effective input price by 80–90% on cache hits.
We do semantic prompt caching at the routing layer. Cache hits return in under 20ms and cost roughly zero. On cache-heavy workloads, this is where the 85% end of the savings range comes from. Details in the RAG cost optimization pattern.
The routing lever
Once you have the cost model, you can ask a more interesting question: if I route this category of call to a cheaper model, does the quality hold?
In most agent loops, roughly 70% of the calls are mechanical — parse a tool response, compose a query, format an output, update memory. These are doable by a $0.14/M DeepSeek or a $0.80/M Haiku. The remaining 30% are the ones that actually need reasoning muscle. If you route accordingly, the agent's effective weighted price drops by 3–5x with no quality degradation.
| Scenario | Avg calls/ticket | Weighted $/1M in | Weighted $/1M out | Cost/ticket |
|---|---|---|---|---|
| All Opus | 25 | $15 | $75 | $0.90 |
| All Sonnet | 25 | $3 | $15 | $0.18 |
| Routed (70/30 Haiku/Sonnet) | 25 | $1.46 | $7.30 | $0.087 |
| Routed + cache hits on 40% of calls | 25 | $1.05 | $5.20 | $0.062 |
Same ticket. Same quality (measured by ticket-resolution rate and CSAT). Different cost. Routing + caching together is the lever.
Three cost-model traps to avoid
Trap 1: Averaging across tiers
If your free-tier users generate 80% of agent calls but only 10% of revenue, averaging gross margin across all users tells you nothing. Separate the cost model per tier. "Free-tier cost per user" and "paid-tier cost per user" are different numbers and they move for different reasons.
Trap 2: Ignoring hit rates
Revenue is per unit of value delivered. Cost is per attempt. If your agent succeeds 80% of the time and fails 20%, your cost per successful unit is 25% higher than your cost per attempt. Model that explicitly; it changes the price you need to charge.
Trap 3: Ignoring amortization
Some agents load an expensive RAG or indexing step on the first call per user-session, then ride on cheap follow-ups. Your cost-per-call model breaks here. Amortize setup costs across the session's expected call count. If a user abandons after one call, your amortized cost is much higher than the per-call average suggests.
The dashboard question to answer every week
Every week, one chart: cost-per-unit trendline for each major product line, overlaid with revenue-per-unit. Two lines per product. If the cost line crosses above a threshold you set (say, 30% of revenue), raise an alarm. Don't wait for the quarterly board deck.
We surface this in the KairosRoute dashboard out of the box — feature-level cost attribution updates continuously, and you can tag any API key with a feature_id to get the separation. More on the minimum observability stack.
Key takeaways
- Cost-per-unit is five numbers. Know them. Publish them. Update weekly.
- Agent costs live in the tail — p95, not avg, is where the margin bleeds.
- Input tokens grow with trace depth; output tokens don't. Optimize memory and caching.
- Routing the mechanical 70% to cheaper models is the biggest single cost lever.
- Separate cost models per tier, per feature, per agent. Don't average.
If you want this built for you, that's the product. Sign up and you'll see your cost-per-unit attribution within 24 hours of routing traffic through us.
Ready to route smarter?
KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.
Related Reading
The OpenAI invoice tells you what you spent. It does not tell you what it was spent on. Here is the observability gap that costs AI teams 30–50% of their margin, and the minimum stack to close it.
At 5K tickets your cost-per-ticket on a frontier model feels fine. At 100K, it is an existential threat. Here is the cost-per-ticket math, the quality guardrails, and the shadow-eval workflow that keeps CSAT up while you cut spend by 70%.
A seed-stage founder walked into a board meeting with an $80K/mo AI bill eating 40% of runway. Three days later, the number was $30K. The router was the easy part. Here is the full play.