Tracking Tool-Call Failures: A Dashboard That Matters
Tool failures cause more agent regressions than model regressions. The five panels, the alert thresholds, and the runbook entry that brings the on-call up to speed.
Five panels
Five panels cover the failure space the on-call needs at a glance. More panels invite eye drift; fewer panels hide the structure of the problem.
- Total failure rate over time. One number with a 30-day trend. The headline metric for the platform team.
- Failure rate by tool. Which tools are flakiest right now. The ranked list is your remediation backlog.
- Failure rate by error class. 4xx versus 5xx versus timeout. Different classes have different owners and different fixes.
- Per-tenant failure rate plus blocked-runs count. Catches integrations that have gone bad for a single customer, and surfaces the downstream impact in agent runs the on-call actually cares about.
Alerts
Three alerts wired off the dashboard route to three different teams. Routing matters more than thresholds; the wrong team paged is wasted time.
- Per-tool failure above 5 percent. Sustained for 5 minutes; page the team that owns the failing tool.
- Aggregate failure above 1 percent. Sustained for 1 minute; page the agent platform team rather than any one tool owner.
- Per-tenant failure above 20 percent. Sustained for 1 minute; page the integrations team that handles the customer-specific wiring.
- Silent escalation. If the first owner does not ack within 10 minutes, escalate one tier and post in #agent-platform; the dashboard view is the audit trail.
Runbook for the on-call
The runbook is three steps. Anything more is a sign the dashboard is not surfacing the right things.
- Step 1: read the dashboard. Which panel is red? The colour tells you which team to pull in before you read the logs.
- Step 2: open the tool’s own dashboard. Linked from the agent dashboard. Is the tool itself broken, or is the agent’s wrapper at fault?
- Step 3: throttle the agent. Reduce the agent’s call rate to the failing tool while diagnostics happen. Throttling prevents the cascade where retries amplify the load on an already-degraded tool.
- Step 4: post the cause. When the failure is contained, post the root cause in the incident channel; future on-call learns from the writeup, not from grep.
Retry vs hard fail
Retry policy lives in the wrapper, not in the agent prompt. The model should not be reasoning about whether to retry a write.
- Idempotent tools. Retry up to 3 times with exponential backoff. Most read tools are idempotent.
- Non-idempotent tools. Do not retry. A retry on a write can cause duplicate effects, double-charges, or split-brain incidents.
- Wrapper-level tag. Each tool carries a retry policy tag. The decision is enforced at the wrapper, not at the prompt.
- Backoff cap. Total retry budget tops out at 10 seconds even for idempotent tools. A retry storm hurts more than the original failure.
Why this dashboard matters most
Of all the agent dashboards, this one earns its real estate first because tool failures are the most common silent regression.
- More frequent than model regressions. The model is mostly stable; the tools change constantly. Most agent regressions trace to a tool, not a prompt.
- Hides as model failure. When the agent reasons over wrong data because a tool returned garbage, the symptom looks like a model error. The dashboard is what catches it correctly.
- Underinvested. Most teams ship the agent without it; the investment pays back in fewer mysterious regressions and fewer pages to the prompt team.
- Cheap to build. The wrapper layer already has the data. The dashboard is one query against the tool-call log table.