Debug: Deadlock
Runbook for diagnosing DB or code deadlocks where requests hang indefinitely waiting for locks.
Symptom: Requests hang indefinitely. DB raises deadlock errors. Transactions rolled back unexpectedly. Service unresponsive under concurrent load.
Quick Diagnosis
| Pattern | Likely cause |
|---|---|
| DB logs show explicit deadlock errors | Two transactions acquiring locks in opposite order |
| Requests hang but no DB error | Application-level lock or async deadlock |
| Only happens under concurrent load | Race condition — single-threaded tests would not catch it |
| Specific endpoint always hangs | That endpoint's transaction locks rows another concurrent transaction needs |
| Hang resolves after timeout | Lock wait timeout set — DB is detecting and killing the loser |
Likely Causes (ranked by frequency)
- Two transactions acquiring the same rows in opposite order
- Long-running transaction holding locks while doing slow work (API call, file write)
- Application-level lock (mutex, semaphore) acquired but never released on exception path
- ORM loading related objects inside a transaction, triggering additional locks
- Async task awaiting a result that is itself waiting for the first task
First Checks (fastest signal first)
- Check DB logs for deadlock errors — Postgres logs the exact transactions and rows involved
- Run
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock'— shows what is blocked and what holds the lock - Check transaction scope — are transactions wrapping more than just DB operations?
- Check lock acquisition order across all code paths — do any two paths lock the same resources in reverse order?
- Check for async circular waits — task A awaits task B which awaits task A
Signal example: Order creation deadlocks under load — transaction 1 locks users then orders; transaction 2 locks orders then users; each waits for the other to release, neither proceeds.
Drill Paths
| Suspect | Go to |
|---|---|
| DB transaction isolation and locking | cs-fundamentals/database-transactions |
| Async deadlock in Python | python/nodejs-async |
| Concurrency primitives and race conditions | cs-fundamentals/concurrency |
| ORM generating unexpected lock queries | python/sqlalchemy |
| Distributed lock patterns | cs-fundamentals/distributed-systems |
Fix Patterns
- Enforce consistent lock acquisition order — always lock resource A before resource B across all code paths
- Shorten transaction scope — commit before doing slow work, not after; never hold a DB lock during an HTTP call
- Use
SELECT ... FOR UPDATE SKIP LOCKEDfor queue-like patterns — avoids contention on the same rows - Set a lock timeout —
SET lock_timeout = '5s'— fail fast rather than hang indefinitely - Use optimistic locking for low-contention writes — check a version column instead of locking rows
When This Is Not the Issue
If there are no DB lock errors and pg_stat_activity shows nothing blocked:
- The hang is not a DB deadlock — check application-level locks, thread pools, or async event loops
- Check connection pool exhaustion — all connections may be in use, not deadlocked
Pivot to synthesis/debug-api-timeout to check whether the hang is a timeout or pool exhaustion rather than a true deadlock.
Connections
cs-fundamentals/database-transactions · cs-fundamentals/concurrency · cs-fundamentals/distributed-systems · python/sqlalchemy · synthesis/debug-api-timeout
Open Questions
- What has changed since this synthesis was written that would alter the conclusions?
- What evidence would cause you to revise the key recommendation here?
Related reading