The Multi-Window Multi-Burn-Rate Alert
The Google SRE pattern: alert on burn rate over multiple windows simultaneously. Why it works, with the configuration.
The idea
Multi-window multi-burn-rate alerting is the SRE-workbook standard for SLO-driven alerting. Three burn-rate alerts at different windows produce balanced detection: fast spikes are caught quickly; sustained drift is caught reliably; transient noise is filtered out. The pattern is the foundation of mature SLO alerting.
What the pattern looks like:
- Three burn-rate alerts at different windows.: The alerts run on different time windows. A fast window (1 hour), a medium window (6 hours), a slow window (3 days). Each window has its own alert.
- Each fires when burn over its window exceeds a threshold.: The burn rate is calculated over the window; if it exceeds the threshold, the alert fires. The thresholds differ by window.
- Faster windows catch acute spikes.: A latency spike that resolves in 30 minutes triggers the 1-hour window's alert. The fast window catches incidents that the slower windows would miss because they average over longer periods.
- Slower windows catch sustained drift.: A small but persistent error rate increase does not trigger the fast window but does trigger the slow window. Over hours or days, the burn accumulates; the slow window detects it.
- The combination produces complete coverage.: Fast spikes alone, sustained drift alone, neither alone covers all incidents. The combination of windows produces coverage of both kinds.
The pattern is what mature SLO alerting looks like. The three windows together produce balanced detection.
Standard config
The standard configuration comes from the SRE workbook. Specific windows and thresholds produce known statistical properties; the team adopts the standard rather than reinventing.
- 1-hour window, alert at 14.4x burn rate.: The 1-hour window with 14.4x threshold fires when 2% of the 30-day error budget is consumed in 1 hour. The threshold is calibrated.
- 6-hour window, 6x.: The 6-hour window with 6x threshold fires when 5% of the budget is consumed in 6 hours. Different sustainment, different consumption rate.
- 3-day window, 1x.: The 3-day window with 1x threshold fires when 10% of the budget is consumed in 3 days. The slow drift threshold; long sustainment.
- All three need to fire simultaneously for the high-priority alert.: The high-priority alert (page) fires only when all three windows' thresholds are breached. The combined signal is high-confidence; the false-positive rate is low.
- Single window can fire low-priority.: A single window's threshold can fire a low-priority alert (ticket). The single signal is informational; investigation is optional.
The standard config produces predictable behavior. The team adopts it; the alerts work as documented.
Why this works
The pattern works because it captures the operational reality of SLOs. Different incidents produce different burn signatures; the multi-window approach matches them all.
- Reduces noise: fast spikes that resolve quickly do not fire.: A 5-minute issue that briefly spikes burn rate does not exceed thresholds at the 1-hour, 6-hour, or 3-day windows. The transient noise is filtered out.
- Sustained issues do.: An issue that persists for 30 minutes triggers the 1-hour window. Issues that persist for 6 hours trigger the 6-hour window. The sustainment determines which window catches the issue.
- Reduces blind spots: slow drifts catch up before exhausting the budget.: A slow drift that does not trigger the fast window does eventually trigger the slow window. The team is alerted before the budget is exhausted; remediation has time.
- Statistical balance.: The combination of windows produces a known balance between false positives and false negatives. The mathematics is documented; the behavior is predictable.
- Standard means broadly applicable.: Other teams' tooling supports the standard; the team can use community resources rather than reinventing. The standard adoption produces operational ease.
Multi-window multi-burn-rate alert is one of those alerting disciplines where the SRE workbook standard is genuinely the right answer for most teams. Nova AI Ops integrates with SLO platforms and alerting systems, supports the standard configuration, and produces the multi-window alerts that mature SLO operations require.