2026-05-19HexSaga

Observability for AI Agents: Logs, Traces, Tokens, and Error Types

A practical guide to AI agent observability, covering request ids, traces, token ledgers, tool calls, error categories, retries, timelines, and retention policy.

Observability for AI Agents: Logs, Traces, Tokens, and Error Types

When a traditional API fails, you can usually inspect request logs, database queries, stack traces, and metrics. When an AI agent fails without observability, the diagnosis often becomes: "It did something strange."

That is not enough.

An agent is not just a chat completion. One user request may contain several model calls, tool calls, retrieval steps, planning, reflection, retries, permission checks, and a final answer. If you only record the final response, you cannot explain why it happened, how much it cost, where it failed, or whether it can be reproduced.

Track the run, not just the message

The core unit of agent observability should be the run.

A run represents one user intent from start to finish. It may contain one model call or many steps. Every run should have a stable request_id or trace_id.

Think of the structure like this:

run
  step 1: model call
  step 2: tool call
  step 3: model call
  step 4: final answer

If your system already uses distributed tracing, connect the agent run to that trace. The user request, business service, model gateway, tool service, and database query should be linkable. If you do not have tracing yet, at least make sure every log line carries the same request id.

Record summaries, outputs, and status for every step

Do not automatically store every full prompt, tool result, and model output forever. That creates privacy, compliance, and storage risk. But recording no content at all makes debugging difficult.

A practical middle ground is structured metadata plus redacted summaries:

FieldPurpose
step_idstep number inside the run
step_typemodel, tool, retrieval, policy, final
input_summaryredacted summary or hash of the input
output_summaryredacted summary or structured result
statussuccess, failed, skipped, retried
latency_mstime spent in this step
error_typenormalized error category

In debugging environments or user-approved sessions, you may temporarily store full content. Production defaults should be more conservative: keep structured fields that locate the problem, not permanent copies of sensitive text.

This belongs in the team's AI policy as well. See A Practical Team AI Usage Policy.

Token ledgers need both run totals and step detail

Agent cost is often surprising because one visible user request can hide many model calls. The final answer's token count does not explain the spend.

Record usage at two levels:

  • Step usage: input tokens, output tokens, cache fields, model name, and estimated cost for each model call.
  • Run usage: total input, total output, total cost, number of model calls, and the most expensive step.

This lets you answer:

  • Which tool result caused the context to grow?
  • Which model step produced too much output?
  • Which task type retries most often?
  • Is the cost driven by user input or by the agent's intermediate work?
  • Is spend high because of model choice or because the loop ran too many times?

If you already manage AI API spend, this ledger should connect to the budgets, limits, and usage logs described in An AI API Cost Control Playbook.

Error categories matter more than raw error text

Agent failures are varied. If you only keep raw error messages, you will struggle to measure and fix patterns.

Start with a small normalized set:

Error typeMeaning
provider_errormodel provider returned an error or timed out
rate_limitedquota or rate limit blocked the call
policy_deniedpermission, data, or safety policy rejected the step
tool_errortool execution failed
tool_bad_inputthe agent passed invalid arguments to a tool
parse_errormodel output could not be parsed
validation_failedoutput parsed but failed business validation
max_steps_exceededthe run hit its step limit
user_cancelleduser cancelled the run

With categories, you can see trends. A high tool_bad_input rate may mean the tool schema is unclear or the prompt does not constrain arguments. A high max_steps_exceeded rate may mean the agent lacks a convergence condition. A high policy_denied rate may mean the product flow does not explain permission boundaries early enough.

Tool calls must be auditable

The highest risk is often not that the agent thought something. It is that the agent did something. Sending email, changing configuration, querying internal data, calling production APIs, committing code, or creating orders must be auditable.

For each tool call, record:

  • tool name and version
  • redacted arguments
  • user or actor represented by the call
  • permission check result
  • external system status
  • idempotency key or business object id
  • whether the call performed a write
  • whether retry is allowed

Write actions deserve special care. Observability is not only for debugging. It is also for audit: who asked the agent to do this, which arguments were passed, why did the system allow it, and which object changed?

A trace should read like a timeline

A useful agent trace should not be only a pile of JSON. It should read like a timeline:

10:00:01 run started by user_123
10:00:02 model chose tool: search_orders
10:00:03 tool search_orders success, 3 records
10:00:04 model chose tool: refund_check
10:00:05 policy_denied: user lacks refund permission
10:00:06 final answer returned

The timeline helps engineers, product managers, and operators understand what happened quickly. The underlying fields still need to be structured for search, aggregation, and alerts.

If only developers can read the trace, incident response will be slow. Agents often affect business workflows, so non-engineering roles also need a basic view of the process.

Alerts should watch more than error rate

Traditional services watch error rate, latency, and QPS. Agents need additional signals:

  • average model calls per run
  • average tokens and cost per run
  • max-step hit rate
  • parse error and validation failure rate
  • tool_bad_input rate
  • policy_denied rate
  • number of high-cost runs
  • failures and retries for write-action tools

These signals can reveal problems before ordinary 500 errors do. Error rate may stay flat while average steps grow from 3 to 9, increasing both cost and latency. Parse errors may rise when output format becomes unstable.

Retention needs tiers

Observability data is not better just because you store more of it.

Use retention tiers:

  • Metric aggregates: keep long term for trend analysis.
  • Usage records: keep longer for billing, audit, and cost analysis.
  • Trace metadata: keep for a medium period for debugging and quality review.
  • Full prompts, outputs, and tool results: short term by default, redacted, sampled, or enabled only when needed.

The exact retention period depends on business, compliance, and cost. The important principle is not to permanently store user input, internal documents, and tool results by accident.

Bottom line

AI agent observability is about making automated execution explainable.

Use the run as the tracing unit. Connect model calls, tools, and business systems with a request id or trace id. Record step status, tokens, latency, and error category. Keep tool calls auditable. Present traces as timelines. Design retention by sensitivity.

Without this foundation, it is hard to know whether a failure came from the model, prompt, tool, permission layer, retry behavior, or business data. Observability is not decoration after launch. It is part of making agents production-ready.