Agentic SRE Advanced By Samson Tanimawo, PhD Published Jun 14, 2026 5 min read

Distributed Tracing for Multi-Agent Systems

When five agents collaborate, a single trace is the only way to debug. The instrumentation, the span layout, and the queries that find the slow specialist.

Why one trace

When five agents collaborate, the only way to debug is a single trace that shows all spans. Per-agent logs are insufficient. OpenTelemetry is the standard, use it; the agent framework should integrate with OTel by default and if it does not, wrap it. One trace per user-visible operation: the triage-then-remediate flow is one trace with sub-spans per agent.

Span layout

The span hierarchy is structured. Root span is the user-visible operation (“handle alert X”); direct children are each agent invocation; grandchildren are tool calls and model calls inside each agent. Each span carries agent_role, model_name, tokens, cost (the carry-over makes per-span analysis trivial); use semantic conventions where they exist (OTel semantic conventions for AI).

Useful queries

Three queries dominate trace usage. “Slow agent runs”: traces with duration > p99 (the slow runs are where bugs hide); “specialist used most often”: span count by agent_role (tells you the dependency graph between specialists); “trace with error”: traces with at least one error span (the error span is the starting point for debugging).

Context propagation

Propagation keeps the trace whole. Trace context (trace_id, span_id) is passed between agents; every inter-agent message carries it, every tool call inherits it; propagation breaks when an agent runs in a separate process without context, so use the standard OTel propagators (W3C TraceContext) to keep everything connected; test propagation because a trace that drops a span at an agent boundary is a broken integration.

Cost of tracing

Tracing has predictable costs. Sampling: 10% trace sampling for high-volume agents, 100% for low-volume (the 10% is fine for aggregates, the 100% is needed for debugging individual issues); storage: traces are voluminous (30 days hot, 90 days warm); latency: tracing adds <5% overhead when configured correctly (bad configurations can add 30%, so profile and tune).