Alerts Practical By Samson Tanimawo, PhD Published Jan 28, 2026 4 min read

Alert Deduplication Strategy

Same incident, multiple alerts. Dedupe early.

Why dedupe

One real failure produces dozens of pages: pod crash, container restart, service unavailable, dependent service error rate spike, downstream timeout. Same root cause, six pagers; without dedupe the on-call clears 30 alerts to find the one that matters; with dedupe the on-call sees one grouped page with dependent symptoms attached.

Alertmanager grouping

Alertmanager grouping is the first layer. group_by on shared labels (service, cluster, severity); group_wait 30 seconds collapses near-simultaneous events; group_interval 5 minutes batches followups. Wildcards in receivers work but match labels narrowly because overly broad groups merge unrelated incidents.

Inhibition rules

Inhibition handles parent-child relationships. When a cluster-wide alert fires, inhibit the per-pod alerts because the cluster issue subsumes them; when upstream X is down, inhibit downstream Y’s error-rate page because the downstream symptom is expected. Document inhibitions in the alert catalog because hidden inhibitions confuse responders during partial failures.

Correlation across signals

Correlation is the AIops layer. BigPanda, Moogsoft, Nova cluster events by topology and time, useful when alert sources are heterogeneous; topology comes from CMDB or service mesh because without topology correlation is just timestamp clustering. Validate clusters during incident review because a cluster that hides a real second incident is worse than no clustering.

Layer in this order

The layering order is opinionated. Start with Alertmanager group_by because it’s free and handles 60% of dedupe needs; add inhibition for known parent-child relationships next, documented; reach for AIops correlation only when the catalog spans multiple alert sources and topology is well-known.