Alert Deduplication Strategy
Same incident, multiple alerts. Dedupe early.
Why dedupe
One real failure produces dozens of pages: pod crash, container restart, service unavailable, dependent service error rate spike, downstream timeout. Same root cause, six pagers; without dedupe the on-call clears 30 alerts to find the one that matters; with dedupe the on-call sees one grouped page with dependent symptoms attached.
- One failure, many pages. Pod crash, container restart, service unavailable, dependent error spike, downstream timeout; same root cause.
- Without dedupe: 30 alerts. On-call clears them to find the one that matters; signal-to-noise destroyed.
- With dedupe: one grouped page. Dependent symptoms attached; the on-call has the full picture.
- Three layers. Grouping, inhibition, correlation; each addresses a different dedupe need.
Alertmanager grouping
Alertmanager grouping is the first layer. group_by on shared labels (service, cluster, severity); group_wait 30 seconds collapses near-simultaneous events; group_interval 5 minutes batches followups. Wildcards in receivers work but match labels narrowly because overly broad groups merge unrelated incidents.
- group_by shared labels. Service, cluster, severity; the natural grouping axis.
- group_wait 30s. Collapses near-simultaneous events into one notification.
- group_interval 5m. Batches followups so the same incident doesn’t repage every minute.
- Narrow label matching. Wildcards work but match narrowly; overly broad groups merge unrelated incidents.
Inhibition rules
Inhibition handles parent-child relationships. When a cluster-wide alert fires, inhibit the per-pod alerts because the cluster issue subsumes them; when upstream X is down, inhibit downstream Y’s error-rate page because the downstream symptom is expected. Document inhibitions in the alert catalog because hidden inhibitions confuse responders during partial failures.
- Cluster-wide subsumes per-pod. Cluster alert fires, per-pod alerts inhibit; the parent issue dominates.
- Upstream-down inhibits downstream. Downstream error-rate is expected when upstream is down; suppress the symptom.
- Documented in catalog. Hidden inhibitions confuse responders during partial failures.
- Per-rule review cadence. Inhibitions reviewed on the alert audit cadence; supports continuous fit.
Correlation across signals
Correlation is the AIops layer. BigPanda, Moogsoft, Nova cluster events by topology and time, useful when alert sources are heterogeneous; topology comes from CMDB or service mesh because without topology correlation is just timestamp clustering. Validate clusters during incident review because a cluster that hides a real second incident is worse than no clustering.
- AIops layers. BigPanda, Moogsoft, Nova; cluster events by topology and time.
- Topology source. CMDB or service mesh; without topology, correlation is just timestamp clustering.
- Heterogeneous sources. Useful when alert sources span multiple tools; cross-tool correlation is the value.
- Validate clusters. A cluster that hides a real second incident is worse than no clustering at all.
Layer in this order
The layering order is opinionated. Start with Alertmanager group_by because it’s free and handles 60% of dedupe needs; add inhibition for known parent-child relationships next, documented; reach for AIops correlation only when the catalog spans multiple alert sources and topology is well-known.
- Start with group_by. Free; handles 60% of dedupe needs; the highest-leverage first step.
- Add inhibition next. Known parent-child relationships; document them in the catalog.
- AIops correlation last. Multiple alert sources and well-known topology; the build-or-buy threshold.
- Per-layer measurement. Each layer’s dedupe contribution measured; supports the order decision.