Agentic SRE Advanced By Samson Tanimawo, PhD Published Jun 2, 2026 5 min read

Tracking Tool-Call Failures: A Dashboard That Matters

Tool failures cause more agent regressions than model regressions. The five panels, the alert thresholds, and the runbook entry that brings the on-call up to speed.

Five panels

Five panels cover the failure space the on-call needs at a glance. More panels invite eye drift; fewer panels hide the structure of the problem.

Alerts

Three alerts wired off the dashboard route to three different teams. Routing matters more than thresholds; the wrong team paged is wasted time.

Runbook for the on-call

The runbook is three steps. Anything more is a sign the dashboard is not surfacing the right things.

Retry vs hard fail

Retry policy lives in the wrapper, not in the agent prompt. The model should not be reasoning about whether to retry a write.

Why this dashboard matters most

Of all the agent dashboards, this one earns its real estate first because tool failures are the most common silent regression.