Distributed Tracing for Multi-Agent Systems
When five agents collaborate, a single trace is the only way to debug. The instrumentation, the span layout, and the queries that find the slow specialist.
Why one trace
When five agents collaborate, the only way to debug is a single trace that shows all spans. Per-agent logs are insufficient. OpenTelemetry is the standard, use it; the agent framework should integrate with OTel by default and if it does not, wrap it. One trace per user-visible operation: the triage-then-remediate flow is one trace with sub-spans per agent.
- Single trace for multi-agent. Per-agent logs insufficient; one trace shows all spans.
- OpenTelemetry standard. Use it; framework integrates by default or wrap it.
- One trace per user operation. Triage-then-remediate is one trace with agent sub-spans.
- Per-flow trace boundary. The user-visible flow defines the trace; agents are sub-spans.
Span layout
The span hierarchy is structured. Root span is the user-visible operation (“handle alert X”); direct children are each agent invocation; grandchildren are tool calls and model calls inside each agent. Each span carries agent_role, model_name, tokens, cost (the carry-over makes per-span analysis trivial); use semantic conventions where they exist (OTel semantic conventions for AI).
- Root: user-visible operation. “Handle alert X”; the trace boundary.
- Children: agent invocations. Each agent gets a span.
- Grandchildren: tools and models. Inside-agent calls; the deep detail.
- Per-span carry: role, model, tokens, cost. Per-span analysis trivial.
Useful queries
Three queries dominate trace usage. “Slow agent runs”: traces with duration > p99 (the slow runs are where bugs hide); “specialist used most often”: span count by agent_role (tells you the dependency graph between specialists); “trace with error”: traces with at least one error span (the error span is the starting point for debugging).
- Slow runs > p99. Where bugs hide; the debugging starting point.
- Span count by role. Dependency graph between specialists; the architecture view.
- Traces with error span. Debugging starting point; the failure surface.
- Per-query stored as view. The queries committed; supports continued use.
Context propagation
Propagation keeps the trace whole. Trace context (trace_id, span_id) is passed between agents; every inter-agent message carries it, every tool call inherits it; propagation breaks when an agent runs in a separate process without context, so use the standard OTel propagators (W3C TraceContext) to keep everything connected; test propagation because a trace that drops a span at an agent boundary is a broken integration.
- Inter-agent context. trace_id, span_id; every message carries; every tool inherits.
- OTel propagators. W3C TraceContext; standard for cross-process.
- Test propagation. Span-drop at agent boundary is broken integration; fix early.
- Per-boundary integration test. Each agent boundary verified; supports continued correctness.
Cost of tracing
Tracing has predictable costs. Sampling: 10% trace sampling for high-volume agents, 100% for low-volume (the 10% is fine for aggregates, the 100% is needed for debugging individual issues); storage: traces are voluminous (30 days hot, 90 days warm); latency: tracing adds <5% overhead when configured correctly (bad configurations can add 30%, so profile and tune).
- 10% high-volume, 100% low-volume. Aggregates need 10%; debugging needs 100%.
- Storage retention. 30 days hot, 90 days warm; the standard window.
- <5% overhead well-configured. 30% with bad configuration; profile and tune.
- Per-deploy overhead check. Tracing overhead measured; supports continued performance.