K
KairosRoute
Blog/The Agent Telemetry Stack: What to Log and Where
Engineering11 min readKairosRoute

The Agent Telemetry Stack: What to Log and Where

Ask a platform engineer who has shipped LLM features at scale what they wish they'd built earlier, and you will hear some variation of: proper telemetry. Not "logs" — logs are too granular to aggregate and too coarse to replay. Telemetry, in the APM sense: structured traces, spans, metrics, and events, wired into something you can query.

This post gives you the schema. Copy it. Adapt it. It'll save you the six months of trial-and-error that every team before you went through.

The five layers

  1. Request trace — every LLM call, with provider, model, tokens, latency, cost.
  2. Agent trace — a parent span grouping all model calls in one agent run, with task metadata.
  3. Tool call span — each tool invocation inside an agent run.
  4. Quality signal events — user feedback, regenerates, downstream retries.
  5. Cost attribution — the rollup layer that joins all of the above into $/feature/user/day.

Layer 1: request trace schema

The minimum fields on every LLM request:

json
{
  "request_id": "req_01HZZZ...",
  "parent_agent_run_id": "agent_01HZZZ...",    // null for one-shot calls
  "user_id": "u_...",
  "workspace_id": "ws_...",
  "feature_id": "support-assistant",
  "task_type": "summarization",                 // from your classifier
  "provider": "anthropic",
  "model": "claude-haiku-4.5",
  "pinned": false,                              // true if user forced a model
  "input_tokens": 1247,
  "output_tokens": 183,
  "cached_tokens": 920,                         // prompt cache hits
  "prompt_cache_hit": true,
  "latency_ms": 812,
  "first_token_ms": 181,                        // TTFT
  "cost_usd": 0.00042,
  "fallback_hops": 0,
  "route_reason": "cheapest-meeting-floor-0.85",
  "classifier_confidence": 0.94,
  "finish_reason": "stop",
  "error": null,
  "timestamp": "2026-04-20T09:12:44.127Z"
}

Every field has a reason. feature_id and task_type are the two that most teams forget and most regret. Without them, you can't build the cost-per-feature and cost-per-task-type charts that drive routing decisions.

Layer 2: agent run trace

One row per agent run. All the request traces above join to this via parent_agent_run_id.

json
{
  "agent_run_id": "agent_01HZZZ...",
  "agent_type": "support-l1",
  "user_id": "u_...",
  "workspace_id": "ws_...",
  "started_at": "2026-04-20T09:12:40Z",
  "ended_at": "2026-04-20T09:13:55Z",
  "duration_ms": 74812,
  "total_calls": 17,
  "total_input_tokens": 22317,
  "total_output_tokens": 2931,
  "total_cost_usd": 0.031,
  "primary_task_type": "support-resolve",
  "outcome": "resolved",              // resolved|escalated|abandoned|errored
  "success_metric": 1.0,              // domain-specific
  "tool_calls_made": 8,
  "tool_calls_failed": 1,
  "frontier_retry": false,
  "user_feedback": null
}

The shape: start time, end time, outcome, resource totals, success metric. You can't compute cost-per-ticket without this row. You can't compute resolution rate without this row. It's the atomic unit of agent analytics.

Layer 3: tool call span

Each invocation of a tool inside an agent run:

json
{
  "tool_span_id": "span_01HZZZ...",
  "parent_agent_run_id": "agent_01HZZZ...",
  "parent_request_id": "req_01HZZZ...",   // the LLM call that made this tool call
  "step_index": 4,
  "tool_name": "crm.get_customer",
  "arguments_valid": true,
  "arguments_json": "{...}",
  "latency_ms": 231,
  "return_status": "ok",                    // ok|error|timeout
  "agent_used_result": true,                // did the next step reference the result?
  "timestamp": "..."
}

The non-obvious field is agent_used_result. The model can call a tool, get a valid result back, and then ignore it in the next step. That's a regression signal — the model wasn't smart enough to use the data it asked for. Track it.

Layer 4: quality signal events

Sparse. One row per user-facing action that implies quality.

json
{
  "event_id": "evt_01HZZZ...",
  "agent_run_id": "agent_01HZZZ...",
  "event_type": "thumbs_down",               // or regenerate|abandon|thumbs_up|comment
  "payload": { "reason": "wrong answer" },
  "timestamp": "..."
}

Emit from the app layer, not from the LLM call. Quality is what the user experienced, not what the model said it did.

Layer 5: cost attribution rollup

Daily job. Join all of the above into a materialized view that answers the five questions a CFO / PM / on-call cares about:

  • How much did each feature spend today?
  • How much did each task type cost, per feature?
  • Which customers / workspaces are expensive?
  • What's the cost-per-outcome (per resolved ticket, per report, per session)?
  • How has cost-per-outcome moved week-over-week?

This is the layer that matters to leadership. Layers 1–4 are for engineers. Layer 5 is for the strategy meeting.

Where to store each layer

Tooling opinions. Swap in your stack's equivalent.

  • Request traces + tool spans → OpenTelemetry spans to your APM (Datadog, Honeycomb, Grafana Tempo). Also pipe to a columnar store (ClickHouse, BigQuery) for ad-hoc queries. APM is for real-time investigations; columnar is for analytics.
  • Agent runs → Your warehouse (BigQuery, Snowflake). One row per run. Joins to request traces by agent_run_id.
  • Quality events → Segment / Rudderstack / Posthog → warehouse.
  • Cost rollup → dbt-style materialization in the warehouse. Refresh daily.

The single most common mistake: putting everything in Datadog and nothing in a warehouse. Datadog is great for p95 latency alerts; it is terrible for "show me the 30-day trend of cost-per-feature split by customer tier."

OpenTelemetry conventions (emerging)

The OpenTelemetry community has been converging on semantic conventions for LLM calls — attributes like gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens. If you're starting fresh, follow those. You get free integration with whatever APM you're on, and your future self thanks you when the standard solidifies.

The conventions don't yet cover agent runs or tool-call spans cleanly; our approach above maps cleanly onto their structure as parent/child spans with custom attributes.

Common pitfalls

Over-logging the prompt

Don't dump full prompts and completions into your primary trace store. Pricey, slow, and a PII nightmare. Instead: keep a hash, keep the first 200 chars for debuggability, and put the full payload in a separate PII-scoped bucket with a short retention. The APM and warehouse should never hold raw user prompts by default.

Sampling at the wrong layer

Don't sample request traces before cost rollup. You'll produce wrong cost attribution. Sample tool-call spans by all means if you're high volume; keep request traces complete. They're tiny rows.

Missing fields

A trace without feature_id, user_id, or task_type is a trace you can't aggregate. Make those three mandatory in your LLM client wrapper. Compile-time, not warning-log.

No classifier = no task_type

Most teams punt on classification because the classifier is a whole project. That leaves task_type unpopulated and the cost-per-task-type chart impossible. If you're not ready to build a classifier, use a zero-shot LLM classifier (GPT-5.4-mini or a small Claude model) for asynchronous post-hoc tagging, not inline on the hot path. It's cheap because it's offline.

What you get from KairosRoute if you route through us

All five layers above, populated automatically, surfaced in the dashboard, and exportable to your warehouse via a webhook or a daily Parquet drop. You still populate feature_id and user_id (via header x-kr-feature-id / x-kr-user-id), because only you know what those mean. Everything else — classifier output, provider, model, tokens, cost, cache hits, fallback hops, latency — is included automatically.

For the conceptual framing of why this matters, see Flying Blind on LLM Costs. For a deeper dive on the "LLM APM" thesis, Agent Observability Is the New APM.

Ready to route smarter?

KairosRoute gives you a single OpenAI-compatible endpoint that routes every request to the cheapest model meeting your quality bar — plus the observability, A/B testing, and cost analytics that turn cheaper infrastructure into a durable margin.

Related Reading

You're Flying Blind on LLM Costs (And It's Expensive)

The OpenAI invoice tells you what you spent. It does not tell you what it was spent on. Here is the observability gap that costs AI teams 30–50% of their margin, and the minimum stack to close it.

Silent Quality Regression: The LLM Bug You Never Notice

Your model bill went down 20%. Nobody complained. Three weeks later, your agent's resolution rate has quietly dropped 12%. This is silent quality regression — and it is the single most dangerous failure mode in LLM ops.

Agent Observability Is the New APM

Application performance monitoring gave every engineering team a dashboard for what their services are doing. Agent observability is the same shift, happening now, for AI-native products. Here is the thesis.