Debug: LLM High Latency
Runbook for diagnosing slow LLM responses, stalling streams, or high time-to-first-token.
Symptom: LLM responses are slow to start, stream stalls mid-response, or total generation time exceeds SLA. Was faster before.
Quick Diagnosis
| Pattern | Likely cause |
|---|---|
| High time-to-first-token (TTFT) | Model overloaded, prompt cache miss, cold start |
| Stream starts fast then stalls | Output token limit too high, network congestion |
| Consistently slow on large inputs | Prompt too long — context processing overhead |
| Slow only at peak hours | Provider rate limiting or capacity pressure |
| Was fast, now slow after prompt change | Prompt cache invalidated by prefix change |
Likely Causes (ranked by frequency)
- Prompt cache miss — prefix changed, full context reprocessed on every call
- Provider capacity pressure at peak — API returning 529 or throttling
- Prompt too large — unnecessary context inflating input tokens
- Output
max_tokensset too high — model generates to limit even when done - No streaming — waiting for full response before returning anything to the user
First Checks (fastest signal first)
- Check TTFT separately from total generation time — is the model slow to start or slow to finish?
- Check whether prompt cache is hitting — look for
cache_read_input_tokensin the response - Check whether streaming is enabled — if not, user waits for full completion before seeing anything
- Check prompt token count — is it growing unboundedly with conversation history?
- Check provider status page and error logs for 529 or rate limit responses
Signal example: TTFT spikes from 400ms to 4s after a system prompt change — the cache prefix no longer matches, so every call pays full context processing cost instead of the cached rate.
Drill Paths
| Suspect | Go to |
|---|---|
| Prompt cache not hitting | apis/anthropic-api |
| Prompt growing too large | prompting/techniques |
| Tracing latency per step | observability/tracing |
| Rate limiting and retry strategy | cs-fundamentals/error-handling-patterns |
| Streaming implementation | web-frameworks/fastapi |
Fix Patterns
- Enable streaming — user sees first token in <1s even if total generation takes 10s
- Pin the cache prefix — keep system prompt and static context identical across calls; only the user turn changes
- Truncate conversation history — keep last N turns only; do not pass full history on every call
- Set
max_tokensto a realistic ceiling for the task — not the model maximum - Add a timeout on the stream — detect stalls and retry rather than waiting indefinitely
When This Is Not the Issue
If TTFT is fast but the user experience still feels slow:
- The bottleneck is downstream of the model — check how your application processes and forwards the stream
- Check whether you are buffering the full response before sending to the client
- Check network latency between your server and the provider endpoint
Pivot to observability/langfuse to break down latency per component across the full call chain.
Connections
apis/anthropic-api · observability/tracing · observability/langfuse · prompting/techniques · cs-fundamentals/error-handling-patterns
Open Questions
- What has changed since this synthesis was written that would alter the conclusions?
- What evidence would cause you to revise the key recommendation here?
Related reading