Debugging Runbooks

Index of 32 production debugging runbooks organised by failure domain. Each runbook is a step-by-step guide for isolating and fixing a specific class of production failure.

These runbooks complement cs-fundamentals/debugging-systems, which covers the underlying methodology. Each runbook here is a concrete, opinionated procedure for a specific failure class.


AI / LLM Issues

Problems specific to language model behaviour, RAG pipelines, and model quality.

Agent Issues

Failures in agentic loops and tool-calling systems.

Database

Query, schema, and transaction failures.

Infrastructure / Cloud

Compute, container, and scaling failures.

Network / Security

Connectivity, TLS, DNS, auth, and secrets failures.

APIs / Services

Upstream dependencies and inter-service communication.

Data / Cache

Stale or inconsistent data across layers.

Observability

When the signals you rely on are missing or wrong.

CI/CD

Build, test, and deployment pipeline failures.


How to use these runbooks

Each runbook follows the same structure:

  1. Symptom — what you observe
  2. Immediate checks — the first 2–3 things to look at before diving deeper
  3. Isolation steps — narrow the failure domain systematically
  4. Common causes — ranked by frequency
  5. Fix — concrete commands or code changes
  6. Verify — how to confirm the issue is resolved

For the underlying methodology — correlation IDs, distributed tracing, structured log reading, hypothesis-driven debugging — see cs-fundamentals/debugging-systems.


Connections

Open Questions

  • Which runbooks are missing from the set? Candidates: debug-rate-limiting, debug-model-context-overflow, debug-vector-store-index-corrupt.
  • Should the AI/LLM runbooks be split into their own hub under synthesis/ai-debugging/?