RAG — Retrieval-Augmented Generation

Production RAG pipeline — hybrid BM25+dense retrieval, Cohere reranking (10-25% precision gain), RAGAS evaluation (faithfulness >0.9, context precision >0.8), and when GraphRAG beats standard retrieval.

Updated Invalid Date·

rag retrieval embeddings chunking reranking graphrag ragas

The production-proven pattern for grounding LLMs in external knowledge without fine-tuning. The LLM is given retrieved context at query time rather than having knowledge baked in at training time.

[Source: Perplexity research, 2026-04-29]

Why RAG Beats Fine-Tuning for Most Use Cases

Fine-tuning bakes knowledge into weights. Expensive, slow to update, and opaque. RAG keeps knowledge in a retrievable store. Cheap to update, inspectable, and citable. 57% of orgs that build AI systems don't fine-tune at all. RAG is the first thing to reach for when the problem is "the model doesn't know X."

RAG is the right choice when:

Knowledge changes frequently (product docs, code, pricing)
You need citations and verifiability
You have < $10K compute budget
Domain knowledge is proprietary

Fine-tuning wins when:

You need a specific style or format not achievable with prompting
Inference latency matters more than update frequency
You're building a specialized task model (code completion, legal classification)

See fine-tuning/decision-framework for the full decision framework.

The RAG Pipeline

Query
  ↓
[Query Processing]   embed, expand, rewrite
  ↓
[Retrieval]          BM25 + dense vector search (hybrid)
  ↓
[Reranking]          Cohere Rerank / cross-encoder
  ↓
[Context Assembly]   top-k chunks → prompt
  ↓
[Generation]         LLM with retrieved context
  ↓
Answer + Citations

Chunking

How you split documents determines retrieval quality more than the retrieval algorithm.

Strategy	Description	Accuracy	When to use
Recursive / fixed-size	Split at 512 tokens, 10–20% overlap	~69%	Default; works well for prose
Semantic	Split at topic boundaries (sentence embeddings)	Better for complex docs	Technical docs, long-form content
Metadata-aware	Preserve headers, code blocks, tables as atomic units	High for structured content	Codebases, API docs, spreadsheets
Late chunking	Embed the full document first, then chunk embedding space	Best for long-doc retrieval	Research papers, books

512 tokens with 10–20% overlap is the production default for most use cases. Smaller chunks (128–256) improve precision; larger chunks (1,024+) improve recall but add noise.

Parent document retrieval (retrieve small chunks for precision, expand to parent chunk for context) is a common trick to get both.

→ Data as a System — the pipeline behind RAG: data freshness SLAs, lineage tracking, contracts between producer and retrieval layer, and what happens when embeddings go stale.

Embedding Models

Model	MTEB Score	Notes
Cohere embed-v4	65.2	Best overall; multilingual; supports binary quantisation
OpenAI text-embedding-3-large	64.6	Widely used; good multilingual
BGE-M3	63.0	Open-source; runs locally; best open model
fastembed	~62.0	Local, fast; good for dev/CI

For most production systems: Cohere embed-v4 if you want managed, BGE-M3 if you need self-hosted or zero-cost.

Retrieval: Hybrid Search

BM25 (lexical) — keyword overlap, exact matches, handles rare/proper nouns well.
Dense vector search — semantic similarity via embeddings, handles paraphrasing and synonyms.
Hybrid — combine BM25 score + vector score with reciprocal rank fusion (RRF).

Hybrid is the production default. BM25 alone misses semantic variations; dense alone misses exact-match keywords. Hybrid outperforms either alone on most benchmarks.

Vector store options: infra/vector-stores. Pgvector (Postgres-native, easiest), Chroma (local dev), Qdrant (production self-hosted), Pinecone (fully managed).

Reranking

The single biggest precision lever in a RAG pipeline. After retrieval, pass top-20 candidates through a cross-encoder reranker and keep top-5.

Reranker	Notes
Cohere Rerank v4.0 Pro	Best quality; 10–25% precision gain; API
Jina Reranker v3	Open-source option; good quality
BGE Reranker	Local, no API cost

Reranking adds ~200ms latency for most workloads. Worth it unless latency is the primary constraint.

GraphRAG

For queries requiring multi-hop reasoning across entities and relationships ("how does X relate to Y?") graph-based retrieval outperforms naive chunk retrieval.

Full GraphRAG (Microsoft):

LLM extracts entities and relationships from all documents → knowledge graph
At query time, traverse the graph to find relevant communities and relationships
Summarise community reports → answer

Cost: very high (many LLM calls for graph construction). Use when complex cross-document reasoning is the primary use case.

LazyGraphRAG (Microsoft, 2024):

Builds minimal graph at index time; constructs community reports lazily at query time
0.1% of the cost of full GraphRAG
70–80% of the quality on most benchmarks

For most use cases: start with hybrid retrieval + reranking. Add LazyGraphRAG if complex multi-hop queries are failing.

Agentic RAG

Rather than a static retrieve-once pipeline, agentic RAG lets the LLM:

Issue multiple retrieval queries
Decide when it has enough context
Reformulate queries when results are poor
Synthesise across retrieved sets

Implemented as a agents/langgraph graph node or a tool the agent calls. The agent loop typically runs 2–4 retrieval iterations before answering.

Evaluation with RAGAS

RAGAS is the standard evaluation framework for RAG pipelines. Four metrics:

Metric	Measures
Faithfulness	Does the answer stick to the retrieved context? (no hallucination)
Answer Relevancy	Is the answer actually relevant to the question?
Context Precision	Are the retrieved chunks relevant?
Context Recall	Did retrieval find all necessary information?

Run RAGAS on a golden set of question/answer/context triples. Target: faithfulness > 0.9, context precision > 0.8 before production.

See evals/methodology for how RAG evaluation fits into the broader eval strategy.

Common Failure Modes

Failure	Cause	Fix
LLM contradicts retrieved context	Low-quality system prompt	Add explicit "answer only from context" instruction
Good chunks retrieved but wrong answer	Chunking loses cross-chunk logic	Parent doc retrieval, larger chunks
Correct knowledge exists but not retrieved	Low recall	Hybrid search, query expansion
Top-k includes noise	Poor reranking	Add reranker; reduce top-k
Answers contain hallucinations	Model fills gaps	Enable citations; check faithfulness

Key Facts

RAG vs fine-tuning default: 57% of orgs don't fine-tune; reach for RAG first unless you need style/format changes
Chunking default: 512 tokens + 10-20% overlap; LangChain's RecursiveCharacterTextSplitter at 512 tokens scored 69% end-to-end accuracy — highest of 7 strategies tested [Source: FloTorch 2026 benchmark via Vecta, Feb 2026]
Reranking gain: 10-25% NDCG improvement; Cohere Rerank v4.0 Pro is the default production choice
RAGAS targets before production: faithfulness >0.9, context precision >0.8
Agentic RAG: typically 2-4 retrieval iterations; LangGraph tool node or standalone tool
LazyGraphRAG: 0.1% of full GraphRAG cost; add it when multi-hop synthesis queries are failing

Common Failure Cases

Retrieval returns irrelevant chunks despite correct query
Why: embedding model mismatch. Query and documents were embedded with different models or different normalisation.
Detect: RAGAS context precision drops below 0.6; manual inspection shows chunks semantically unrelated to query.
Fix: re-embed all documents with the same model used for queries; verify cosine similarity scores are in expected range.

LLM answer contradicts retrieved context
Why: system prompt lacks an explicit grounding instruction, so the model blends parametric knowledge with retrieved text.
Detect: RAGAS faithfulness below 0.85; answers contain claims not present in any retrieved chunk.
Fix: add "Answer only from the provided context. If the context does not contain the answer, say so." to the system prompt.

Reranker makes results worse
Why: reranker was trained on a domain different from your corpus, or top-20 candidates fed to it contain too much noise.
Detect: context precision drops after adding reranker; run RAGAS with and without reranking on the same query set.
Fix: evaluate domain-matched reranker (BGE Reranker for generic, Cohere for multilingual); widen initial retrieval to top-30 before reranking.

Embeddings go stale after knowledge base update
Why: documents were re-chunked or metadata changed without re-embedding; stale vectors no longer map to current content.
Detect: retrieval returns chunks whose stored content doesn't match what's in the source document; freshness checks fail.
Fix: trigger a re-index pipeline on every document update; use content hash to detect changed documents.

Token budget exceeded assembling top-k chunks
Why: chunk size × k exceeds context window; common when chunks are large (1,024+ tokens) and k=10.
Detect: 413 errors or silent truncation; generation quality drops because context is cut mid-sentence.
Fix: reduce chunk size or k; use a context assembly step that trims to fit the budget rather than hard-cutting.

Agentic RAG loops without converging
Why: the agent re-queries on every turn because retrieval never fully satisfies the stopping condition.
Detect: trace shows >4 retrieval iterations on a single query; token cost spikes on that query class.
Fix: add an explicit "sufficient context" check node; cap iterations at 4 and fall back to "I don't have enough information."

Connections

apis/anthropic-api — feeding retrieved context to Claude; Citations API
infra/vector-stores — storage backends for embeddings
evals/methodology — evaluating RAG pipeline quality with RAGAS
prompting/techniques — how to structure retrieved context in prompts
fine-tuning/decision-framework — when RAG isn't enough
rag/chunking — chunking strategies
rag/embeddings — embedding model selection
rag/hybrid-retrieval — BM25 + dense hybrid retrieval
rag/reranking — second-pass scoring
rag/graphrag — GraphRAG variant for entity- and relationship-rich corpora; LazyGraphRAG cuts cost ~1000x

Open Questions

When agentic RAG runs 2-4 retrieval iterations, how does cost compare to GraphRAG for the same query?
What is the practical faithfulness ceiling for RAG systems on adversarial or ambiguous queries?
Does the RAG vs fine-tuning decision change as inference cost drops toward zero?