The Agent Cost Bomb: Pre-emptive Token Budgets
One stuck agent can burn $400 in an hour. The budget enforcement layer that stops it before it does, plus the alerting that wakes you up if budgets blow up across runs.
Three budget dimensions
Token budgets need three dimensions because the failure modes operate at different scopes. A single budget cap protects against one failure shape and misses the others.
- Per-run budget. How much a single invocation can cost. Caps the worst-case run.
- Per-tenant budget. How much a single user or service can trigger in a window. Prevents abuse and runaway integrations.
- Aggregate budget. How much the agent fleet can spend per day. Catches rare but expensive scenarios that slip per-run caps.
- Per-action-type budget. Some actions (long-context summarisation, vector search) have their own per-day cap independent of the per-run total.
Enforcement layer
Enforcement happens before the model call, not after. After-the-fact accounting is billing, not safety.
- Track tokens in the loop. Before each model call, check whether the next call would exceed the budget. If yes, abort or escalate without calling.
- Cheap check. A counter and a comparison. Far cheaper than the model call it might prevent.
- Real-time accounting. Cost is computed live, not retrospectively. Pre-call accounting is the only way enforcement can prevent overspend.
- Audit on rejection. Every budget-prevented call is logged with the prompt size and the projected cost. The audit is what justifies the budget at review time.
Alerting on budget excursions
Budget alerts route differently by scope. Per-run hits are individually noisy; aggregate hits are individually critical.
- Per-run hit rate. Page on the rate, not individual events. “More than 1 percent of runs hit the cap today” is the alert.
- Aggregate hits page hard. The fleet is misbehaving; someone looks immediately.
- Per-tenant dashboards. Reviewed daily. Patterns reveal abuse, runaway integrations, or legitimate cases where the budget needs raising.
- Trend alerting. Week-over-week growth in budget consumption is a leading indicator. A 20 percent jump without a release pages the platform team.
Calibrating the budgets
Budget calibration is empirical. Three steps move it from arbitrary to defensible.
- Sample p95. Take p95 of normal runs and round up. That is the starting per-run budget.
- Month-one review. Every run hitting the cap is reviewed. If it should have completed, raise the cap; if it was misbehaving, leave the cap.
- Quarterly re-calibrate. Re-anchor against the latest p95. Models get cheaper; budgets should drop. A budget unchanged for 12 months is probably loose.
- Per-tier sizing. Triage agents and postmortem agents need different budgets. Calibrate per role rather than applying one number across the fleet.
Fail closed on budget exhaustion
When the budget runs out, the agent stops. Failing open invites runaway runs and the cost-bomb shape this whole pattern exists to prevent.
- Hard stop. The agent stops. It does not try harder, skip steps, or continue without verification.
- Escalate with partial state. The escalation includes what was learned before the budget ran out. Operators get warm context.
- Resist the bump reflex. Each escalation is a chance to ask why the budget was inadequate. Usually the answer is a prompt that needs work, not a budget that needs raising.
- Document raises. When a raise is granted, log the reason. The log surfaces patterns where the same agent keeps asking for more budget without producing more value.