Debug: Alert Firing Incorrectly

Runbook for diagnosing alerts that fire when nothing is wrong, or fail to fire when something is.

Updated Invalid Date·

debugging alerting observability false-positive runbook

Symptom: Alert fires but the system is healthy. Or the system is clearly broken but no alert fired. On-call woken for nothing, or incident missed entirely.

Quick Diagnosis

Pattern	Likely cause
Alert fires during deploys only	Threshold too tight — catches transient deploy spike
Alert fires then self-resolves	No evaluation window — single data point triggering alert
Alert never fires despite visible errors	Threshold too high, wrong metric, or alert not routing correctly
Alert fires for one user, not all	Aggregate metric masking per-customer degradation
Alert storm — everything fires at once	Cascading dependency failure triggering every downstream alert

Likely Causes (ranked by frequency)

Threshold set without a baseline — guessed value, not derived from historical data
No evaluation window — alert fires on a single spike instead of sustained condition
Wrong metric — measuring the wrong thing; latency alert on a queue-based system misses backlog
Alert routes to wrong channel or is silenced
Aggregate hides individual failure — p50 healthy while p99 is on fire

First Checks (fastest signal first)

Check the alert condition — what metric, what threshold, what evaluation window?
Check historical baseline — is the threshold above or below normal operating values?
Confirm the alert is routing correctly — does it reach the right channel and the right person?
Check the evaluation window — is it 1 minute, 5 minutes? A single data point alert is noise
Check whether p99 and p50 diverge — aggregate metrics hiding tail degradation

Signal example: Error rate alert fires every Tuesday morning — deploy happens Tuesday at 9am; during rolling restart, error rate briefly spikes to 2%; threshold is 1%; evaluation window is 1 minute with no minimum duration; alert fires on a transient deploy spike.

Drill Paths

Suspect	Go to
Alert threshold and evaluation design	cs-fundamentals/observability-se
Metrics and SLO configuration	cloud/cloud-monitoring
Observability stack setup	cloud/observability-stack
Reducing alert fatigue	observability/tracing

Fix Patterns

Set thresholds from historical data — use p99 of normal operating baseline plus a buffer, not a round number
Add a minimum duration — alert only if the condition persists for 5 minutes, not a single data point
Use p99 not p50 for latency alerts — p50 can be healthy while users are experiencing failures
Add a burn rate alert for SLOs — measures how fast you are consuming the error budget, not just the current error rate
Silence alerts during deploys with a maintenance window — do not wake on-call for known transients

When This Is Not the Issue

If the alert threshold and routing are correct but alerts are still noisy:

The system genuinely is degraded during those periods — investigate whether the transient is acceptable or should be fixed
Check whether the metric has a known collection lag — alerting on a metric that arrives late creates phantom delays

Pivot to cs-fundamentals/observability-se to redesign the alerting strategy around SLOs and error budgets rather than raw metric thresholds.

Connections

cs-fundamentals/observability-se · cloud/cloud-monitoring · cloud/observability-stack · observability/tracing

Open Questions

What has changed since this synthesis was written that would alter the conclusions?
What evidence would cause you to revise the key recommendation here?