RAG — Retrieval-Augmented Generation

Production RAG pipeline — hybrid BM25+dense retrieval, Cohere reranking (10-25% precision gain), RAGAS evaluation (faithfulness >0.9, context precision >0.8), and when GraphRAG beats standard retrieval.

The production-proven pattern for grounding LLMs in external knowledge without fine-tuning. The LLM is given retrieved context at query time rather than having knowledge baked in at training time.

[Source: Perplexity research, 2026-04-29]


Why RAG Beats Fine-Tuning for Most Use Cases

Fine-tuning bakes knowledge into weights. Expensive, slow to update, and opaque. RAG keeps knowledge in a retrievable store. Cheap to update, inspectable, and citable. 57% of orgs that build AI systems don't fine-tune at all. RAG is the first thing to reach for when the problem is "the model doesn't know X."

RAG is the right choice when:

  • Knowledge changes frequently (product docs, code, pricing)
  • You need citations and verifiability
  • You have < $10K compute budget
  • Domain knowledge is proprietary

Fine-tuning wins when:

  • You need a specific style or format not achievable with prompting
  • Inference latency matters more than update frequency
  • You're building a specialized task model (code completion, legal classification)

See fine-tuning/decision-framework for the full decision framework.


The RAG Pipeline

Query
  ↓
[Query Processing]   embed, expand, rewrite
  ↓
[Retrieval]          BM25 + dense vector search (hybrid)
  ↓
[Reranking]          Cohere Rerank / cross-encoder
  ↓
[Context Assembly]   top-k chunks → prompt
  ↓
[Generation]         LLM with retrieved context
  ↓
Answer + Citations

Chunking

How you split documents determines retrieval quality more than the retrieval algorithm.

StrategyDescriptionAccuracyWhen to use
Recursive / fixed-sizeSplit at 512 tokens, 10–20% overlap~69%Default; works well for prose
SemanticSplit at topic boundaries (sentence embeddings)Better for complex docsTechnical docs, long-form content
Metadata-awarePreserve headers, code blocks, tables as atomic unitsHigh for structured contentCodebases, API docs, spreadsheets
Late chunkingEmbed the full document first, then chunk embedding spaceBest for long-doc retrievalResearch papers, books

512 tokens with 10–20% overlap is the production default for most use cases. Smaller chunks (128–256) improve precision; larger chunks (1,024+) improve recall but add noise.

Parent document retrieval (retrieve small chunks for precision, expand to parent chunk for context) is a common trick to get both.

Data as a System — the pipeline behind RAG: data freshness SLAs, lineage tracking, contracts between producer and retrieval layer, and what happens when embeddings go stale.


Embedding Models

ModelMTEB ScoreNotes
Cohere embed-v465.2Best overall; multilingual; supports binary quantisation
OpenAI text-embedding-3-large64.6Widely used; good multilingual
BGE-M363.0Open-source; runs locally; best open model
fastembed~62.0Local, fast; good for dev/CI

For most production systems: Cohere embed-v4 if you want managed, BGE-M3 if you need self-hosted or zero-cost.


BM25 (lexical) — keyword overlap, exact matches, handles rare/proper nouns well.
Dense vector search — semantic similarity via embeddings, handles paraphrasing and synonyms.
Hybrid — combine BM25 score + vector score with reciprocal rank fusion (RRF).

Hybrid is the production default. BM25 alone misses semantic variations; dense alone misses exact-match keywords. Hybrid outperforms either alone on most benchmarks.

Vector store options: infra/vector-stores. Pgvector (Postgres-native, easiest), Chroma (local dev), Qdrant (production self-hosted), Pinecone (fully managed).


Reranking

The single biggest precision lever in a RAG pipeline. After retrieval, pass top-20 candidates through a cross-encoder reranker and keep top-5.

RerankerNotes
Cohere Rerank v4.0 ProBest quality; 10–25% precision gain; API
Jina Reranker v3Open-source option; good quality
BGE RerankerLocal, no API cost

Reranking adds ~200ms latency for most workloads. Worth it unless latency is the primary constraint.


GraphRAG

For queries requiring multi-hop reasoning across entities and relationships ("how does X relate to Y?") graph-based retrieval outperforms naive chunk retrieval.

Full GraphRAG (Microsoft):

  1. LLM extracts entities and relationships from all documents → knowledge graph
  2. At query time, traverse the graph to find relevant communities and relationships
  3. Summarise community reports → answer

Cost: very high (many LLM calls for graph construction). Use when complex cross-document reasoning is the primary use case.

LazyGraphRAG (Microsoft, 2024):

  • Builds minimal graph at index time; constructs community reports lazily at query time
  • 0.1% of the cost of full GraphRAG
  • 70–80% of the quality on most benchmarks

For most use cases: start with hybrid retrieval + reranking. Add LazyGraphRAG if complex multi-hop queries are failing.


Agentic RAG

Rather than a static retrieve-once pipeline, agentic RAG lets the LLM:

  • Issue multiple retrieval queries
  • Decide when it has enough context
  • Reformulate queries when results are poor
  • Synthesise across retrieved sets

Implemented as a agents/langgraph graph node or a tool the agent calls. The agent loop typically runs 2–4 retrieval iterations before answering.


Evaluation with RAGAS

RAGAS is the standard evaluation framework for RAG pipelines. Four metrics:

MetricMeasures
FaithfulnessDoes the answer stick to the retrieved context? (no hallucination)
Answer RelevancyIs the answer actually relevant to the question?
Context PrecisionAre the retrieved chunks relevant?
Context RecallDid retrieval find all necessary information?

Run RAGAS on a golden set of question/answer/context triples. Target: faithfulness > 0.9, context precision > 0.8 before production.

See evals/methodology for how RAG evaluation fits into the broader eval strategy.


Common Failure Modes

FailureCauseFix
LLM contradicts retrieved contextLow-quality system promptAdd explicit "answer only from context" instruction
Good chunks retrieved but wrong answerChunking loses cross-chunk logicParent doc retrieval, larger chunks
Correct knowledge exists but not retrievedLow recallHybrid search, query expansion
Top-k includes noisePoor rerankingAdd reranker; reduce top-k
Answers contain hallucinationsModel fills gapsEnable citations; check faithfulness

Key Facts

  • RAG vs fine-tuning default: 57% of orgs don't fine-tune; reach for RAG first unless you need style/format changes
  • Chunking default: 512 tokens + 10-20% overlap; LangChain's RecursiveCharacterTextSplitter at 512 tokens scored 69% end-to-end accuracy — highest of 7 strategies tested [Source: FloTorch 2026 benchmark via Vecta, Feb 2026]
  • Reranking gain: 10-25% NDCG improvement; Cohere Rerank v4.0 Pro is the default production choice
  • RAGAS targets before production: faithfulness >0.9, context precision >0.8
  • Agentic RAG: typically 2-4 retrieval iterations; LangGraph tool node or standalone tool
  • LazyGraphRAG: 0.1% of full GraphRAG cost; add it when multi-hop synthesis queries are failing

Common Failure Cases

Retrieval returns irrelevant chunks despite correct query
Why: embedding model mismatch. Query and documents were embedded with different models or different normalisation.
Detect: RAGAS context precision drops below 0.6; manual inspection shows chunks semantically unrelated to query.
Fix: re-embed all documents with the same model used for queries; verify cosine similarity scores are in expected range.

LLM answer contradicts retrieved context
Why: system prompt lacks an explicit grounding instruction, so the model blends parametric knowledge with retrieved text.
Detect: RAGAS faithfulness below 0.85; answers contain claims not present in any retrieved chunk.
Fix: add "Answer only from the provided context. If the context does not contain the answer, say so." to the system prompt.

Reranker makes results worse
Why: reranker was trained on a domain different from your corpus, or top-20 candidates fed to it contain too much noise.
Detect: context precision drops after adding reranker; run RAGAS with and without reranking on the same query set.
Fix: evaluate domain-matched reranker (BGE Reranker for generic, Cohere for multilingual); widen initial retrieval to top-30 before reranking.

Embeddings go stale after knowledge base update
Why: documents were re-chunked or metadata changed without re-embedding; stale vectors no longer map to current content.
Detect: retrieval returns chunks whose stored content doesn't match what's in the source document; freshness checks fail.
Fix: trigger a re-index pipeline on every document update; use content hash to detect changed documents.

Token budget exceeded assembling top-k chunks
Why: chunk size × k exceeds context window; common when chunks are large (1,024+ tokens) and k=10.
Detect: 413 errors or silent truncation; generation quality drops because context is cut mid-sentence.
Fix: reduce chunk size or k; use a context assembly step that trims to fit the budget rather than hard-cutting.

Agentic RAG loops without converging
Why: the agent re-queries on every turn because retrieval never fully satisfies the stopping condition.
Detect: trace shows >4 retrieval iterations on a single query; token cost spikes on that query class.
Fix: add an explicit "sufficient context" check node; cap iterations at 4 and fall back to "I don't have enough information."

Connections

Open Questions

  • When agentic RAG runs 2-4 retrieval iterations, how does cost compare to GraphRAG for the same query?
  • What is the practical faithfulness ceiling for RAG systems on adversarial or ambiguous queries?
  • Does the RAG vs fine-tuning decision change as inference cost drops toward zero?