Debug: RAG Returning Wrong Context

Runbook for diagnosing RAG pipelines returning irrelevant, incomplete, or hallucinated answers.

Updated Invalid Date·

debugging rag retrieval embeddings runbook

Symptom: LLM answers are wrong, hallucinated, or too generic despite relevant documents existing in the knowledge base. Retrieved chunks do not match the question.

Quick Diagnosis

Pattern	Likely cause
Answer is hallucinated despite docs existing	Retrieval is not finding the right chunks
Answer is partially right but incomplete	Chunks are too small or split across a boundary
Answer degrades over time	Index is stale — new documents not ingested
Answer correct for simple queries, wrong for complex	Embedding model cannot handle multi-concept queries
Answer ignores retrieved context entirely	Context window too full — retrieved chunks are lost in the middle

Likely Causes (ranked by frequency)

Retrieval returning irrelevant chunks — embedding similarity is not matching intent
Chunk boundaries splitting key information across two chunks
Index stale — documents updated but embeddings not regenerated
Retrieved context too long — model ignores middle chunks
Wrong retrieval strategy — dense-only when hybrid (BM25 + dense) would work better

First Checks (fastest signal first)

Log the retrieved chunks — are the right documents actually being returned before the LLM sees them?
Check retrieval score thresholds — are low-similarity chunks being passed through?
Confirm index freshness — when was the last embed and index run?
Check chunk size — are answers split across chunk boundaries?
Check context window usage — are retrieved chunks being truncated before reaching the model?

Signal example: LLM says "I don't have information on X" but the document exists — retrieved chunks logged show top 3 results are all unrelated; similarity scores are below 0.5 on all.

Drill Paths

Suspect	Go to
Retrieval not finding the right documents	rag/embeddings
Chunk boundaries breaking coherent answers	rag/chunking
Scores too low across all queries	rag/pipeline
Want to add reranking to improve precision	rag/pipeline
Evaluating whether retrieval is actually working	evals/methodology

Fix Patterns

Add reranking after retrieval — single biggest precision gain; use Cohere Rerank or Jina Reranker v3
Switch to hybrid retrieval (BM25 + dense) — keyword matching catches what semantic search misses
Increase chunk overlap — prevents key information from being split at boundaries
Lower similarity threshold cautiously — too low returns noise; too high returns nothing
Log retrieved chunks on every query in production — invisible retrieval failures are the most common RAG bug

When This Is Not the Issue

If retrieved chunks are correct and relevant but the answer is still wrong:

The problem is in generation, not retrieval
Check whether the prompt instructs the model to use only the provided context
Check for lost-in-the-middle failure — relevant chunk may be retrieved but buried in a long context window

Pivot to prompting/techniques to tighten the prompt's instruction to ground answers in retrieved context only.

Connections

rag/pipeline · rag/chunking · rag/embeddings · evals/methodology · prompting/techniques · llms/hallucination

Open Questions

What has changed since this synthesis was written that would alter the conclusions?
What evidence would cause you to revise the key recommendation here?