Debug: CI Pipeline Failing
Runbook for diagnosing CI pipelines that fail in CI but pass locally.
Symptom: Build or tests pass locally but fail in CI. Pipeline was passing, now failing without code changes. Intermittent failures blocking merges.
Quick Diagnosis
| Pattern | Likely cause |
|---|---|
| Fails in CI, always passes locally | Environment difference — OS, Node version, missing env var |
| Was passing, now failing without code change | Dependency version unpinned, external service changed |
| Fails intermittently | Flaky test, network call to external service, race condition |
| Fails only on one branch | Branch-specific env var or secret not configured |
| Fails after adding more tests | Parallelism conflict or resource exhaustion on CI runner |
Likely Causes (ranked by frequency)
- Missing environment variable or secret not configured in CI
- Unpinned dependency pulled a breaking version
- Flaky test — timing or shared state issue that CI exposes under load
- Different OS or runtime version between local and CI
- External service call in tests — rate limited or unavailable in CI environment
First Checks (fastest signal first)
- Read the exact error in CI logs — do not assume it is the same failure as local; read it fresh
- Check whether required env vars are set in the CI environment — compare against local
.env - Check whether the failure is consistent or intermittent — run the job 3 times to confirm
- Check dependency lock file — was
package-lock.jsonorrequirements.txtupdated recently? - Check whether the test makes any network calls — external calls are the most common CI-specific failure
Signal example: Tests fail in CI with connection refused on a database call — CI workflow does not have the Postgres service container configured; works locally because a local DB is running.
Drill Paths
| Suspect | Go to |
|---|---|
| Service containers in GitHub Actions | cloud/github-actions |
| Flaky test investigation | synthesis/debug-flaky-test |
| Dependency version pinning | python/npm-pnpm-ecosystem |
| Test environment differences | qa/test-environments |
| Secrets and env vars in CI | cloud/secrets-management |
Fix Patterns
- Pin all dependency versions in lock files and commit them — never rely on
latestin CI - Add service containers to the workflow for DB, Redis, or any other dependency tests need
- Mock all external HTTP calls in tests — never hit real external APIs in CI
- Match CI runtime versions exactly to production — specify Node, Python, Java versions explicitly in the workflow
- Cache dependencies between runs — reduces variance from network fetches and speeds up the pipeline
When This Is Not the Issue
If the environment matches and dependencies are pinned but CI still fails:
- The test itself has an assumption that only holds locally — check for hardcoded file paths, absolute URLs, or local-only config
- Check CI runner resource limits — tests may be passing but timing out on an underpowered runner
Pivot to technical-qa/ci-cd-quality-gates to audit the pipeline configuration for missing gates and environment parity checks.
Connections
cloud/github-actions · qa/test-environments · synthesis/debug-flaky-test · technical-qa/ci-cd-quality-gates · cloud/secrets-management
Open Questions
- What has changed since this synthesis was written that would alter the conclusions?
- What evidence would cause you to revise the key recommendation here?
Related reading