RAG vs Fine-Tuning

RAG manages knowledge (external, updateable, citable); fine-tuning shapes behaviour (style, terminology, format) — 57% of LLM-deploying organisations use RAG without fine-tuning; the most powerful pattern combines both.

Updated Invalid Date·

rag fine-tuning decision-guide comparison architecture

The question every AI engineer gets asked. The answer: they solve different problems and often work best together. RAG manages knowledge; fine-tuning shapes behaviour.

The Core Distinction

RAG (Retrieval-Augmented Generation): gives the model access to external knowledge at inference time. The model itself doesn't change. The knowledge lives in a vector store.

Fine-tuning: changes the model's weights. The model learns new behaviour, style, or domain-specific patterns. The knowledge is encoded in the parameters.

When RAG Is the Right Choice

Frequently updating knowledge

Company documentation, product catalogs, pricing, news, support tickets. Anything that changes weekly or daily. Updating a RAG index takes minutes. Fine-tuning takes hours to days and requires redeployment.

Long-tail factual knowledge

Facts that appear rarely in training data but matter for accuracy. The model can retrieve the exact text; it can't memorise facts it saw once in fine-tuning.

Attribution and citation

RAG can show exactly which document supported which claim. Fine-tuned knowledge has no provenance. The model produces outputs it "knows" without being able to cite where.

Large knowledge corpus

A 10M-document knowledge base can be indexed in a vector store. Fine-tuning on 10M documents requires compute that costs hundreds of thousands of dollars.

Regulated environments

Audit trails, data residency requirements, right-to-be-forgotten. RAG externalises the knowledge; you can delete a document and it's gone from retrieval immediately.

When Fine-Tuning Is the Right Choice

Consistent style and format

You need every response to follow a specific format, tone, or style. RAG adds knowledge but doesn't change how the model responds. Fine-tuning trains the style into the weights.

Domain-specific terminology

The model needs to understand and use jargon correctly ("adjudication", "escrow", "VLAN trunk", "FIX protocol"). RAG can retrieve definitions; fine-tuning makes the model fluent in the domain.

Behaviours, not facts

"Always respond in Spanish", "never use bullet points", "when the user asks X always confirm Y first". These are behavioural patterns. RAG can't teach behaviour; fine-tuning can.

Reducing prompt length

A fine-tuned model knows your domain and doesn't need extensive few-shot examples in every prompt. Can cut prompt length by 50-80%.

Latency-sensitive, no retrieval budget

If you can't afford 100-200ms for retrieval, bake the knowledge into the model. Fine-tuned inference is the same speed regardless of corpus size.

Structured output compliance

If you need the model to reliably output a specific JSON schema or follow a complex format, fine-tuning on examples of correct outputs is more reliable than prompting.

When to Use Both

The most powerful pattern: fine-tune for behaviour + RAG for knowledge.

Fine-tuned layer:
  - Responds in your brand voice
  - Understands domain terminology
  - Follows your output format
  - Knows when to escalate to a human

RAG layer:
  - Current product documentation
  - Policy documents
  - Customer-specific context
  - Real-time data

Example: a customer support bot fine-tuned on historical support tickets (learns tone, escalation patterns, domain jargon) + RAG over current product documentation (knows what features exist today).

Decision Matrix

Question	RAG	Fine-tune	Both
Knowledge changes frequently?	✓
Need citations/attribution?	✓
Need consistent style/tone?		✓
Need domain fluency?		✓
Large knowledge corpus (>100K docs)?	✓
Behavioural patterns to teach?		✓
Need to reduce prompt length?		✓
Best possible quality?			✓
Limited budget/time?	✓
Need both updated knowledge + trained behaviour?			✓

Cost Comparison

RAG operating costs

Embedding model:  $0.13/M tokens (text-embedding-3-large)
Vector store:     $0.10/GB/month (Qdrant Cloud)
Reranker:         $2.00/1K searches (Cohere)
LLM call:         $3.00/M input tokens (Claude Sonnet 4.6)

Typical per-query cost:
  Embed query:     $0.00002  (150 tokens)
  Rerank 50 docs:  $0.002
  LLM (10K tokens): $0.03
  Total:           ~$0.032/query

Fine-tuning costs

Dataset preparation:  $500-5,000 (labelling, cleaning)
Fine-tuning compute:  $50-500 (Axolotl on Lambda, 7B model, 1-3 hours)
Fine-tuning API:      $3-50 (OpenAI/Anthropic hosted fine-tuning)
Inference (no RAG):   Lower token count → lower per-call cost

Fine-tuning has higher upfront cost, potentially lower per-query cost at scale.

Quality Comparison

RAG quality depends on: retrieval precision, chunk quality, reranking. Fine-tuning quality depends on: dataset quality, objective choice, hyperparameters.

Metric	RAG	Fine-tuning
Factual accuracy (recent data)	High	Low (data cutoff)
Factual accuracy (static domain)	Medium	High
Response consistency	Low-medium	High
Hallucination rate	Lower (grounded)	Higher (no grounding)
Out-of-domain generalisation	Good (retrieval finds it)	Poor

The 57% Number

A widely-cited 2024 survey: 57% of organisations deploying LLMs use RAG but do not fine-tune. Fine-tuning is perceived as high-effort, high-risk. RAG is the default because:

It's reversible (update index, not model)
No ML expertise required
Results are explainable (show the retrieved source)

Fine-tuning adoption is growing as tooling (Axolotl, Unsloth, QLoRA) has made it accessible.

Key Facts

RAG: knowledge lives in vector store; model weights unchanged; update index in minutes
Fine-tuning: knowledge encoded in weights; no provenance — model can't cite where it learned it
57% of LLM-deploying organisations use RAG without fine-tuning (2024 survey)
Fine-tuning can reduce prompt length by 50-80% by eliminating few-shot examples once behaviour is trained in
Fine-tuning compute cost: $50-500 for a 7B model on Axolotl on Lambda (1-3 hours)
RAG per-query cost breakdown: embed ($0.00002) + rerank ($0.002) + LLM 10K tokens ($0.03) ≈ $0.032/query
RAG hallucination rate: lower (grounded in retrieved text); fine-tuned hallucination rate: higher (no grounding)
Regulatory/audit use case: RAG wins — delete a document from the index and it's immediately gone from retrieval
Best pattern: fine-tune for behaviour (tone, format, jargon) + RAG for knowledge (current docs, policies)

Connections

rag/pipeline — full RAG implementation end to end
fine-tuning/decision-framework — detailed fine-tuning decision tree
fine-tuning/lora-qlora — LoRA and QLoRA; cheapest path to fine-tuning
prompting/techniques — always try prompting before either RAG or fine-tuning
synthesis/llm-decision-guide — the broader decision tree where RAG vs fine-tuning fits
rag/reranking — the single biggest RAG quality improvement after basic retrieval works

Open Questions

Does the 57% RAG-only statistic reflect genuine product fit, or are most teams avoiding fine-tuning because the tooling still feels intimidating?
For the "fine-tune for behaviour + RAG for knowledge" combined pattern, how do you prevent fine-tuning from interfering with the model's ability to follow RAG context faithfully?
Will Anthropic's hosted fine-tuning API change the economics enough that fine-tuning becomes the first thing teams try rather than the last?