Fine-Tuning
Fine-tuning decision framework — 57% of AI organisations never fine-tune; the decision tree runs prompting, then RAG, then SFT, then DPO/GRPO, escalating only when the prior approach genuinely fails.
Updating a pretrained model's weights on a domain-specific dataset to change its behaviour, style, or capabilities. The last resort after prompting and RAG have been exhausted, but the right tool when they genuinely can't solve the problem.
The Decision Framework
1. Can better prompting solve it? → Try XML structuring, few-shot, CoT (see [prompting/techniques](/prompting/techniques))
2. Can RAG solve it? → Add a knowledge store (see [rag/pipeline](/rag/pipeline))
3. Is it a format/style problem? → SFT fine-tuning (collect 500–5,000 examples)
4. Is it a values/preference problem? → DPO/GRPO (collect preference pairs)
5. Is it a capability gap? → Full fine-tuning or pretraining on domain data
57% of organisations that build AI products don't fine-tune at all. For most knowledge retrieval and reasoning tasks, prompting + RAG is enough. Fine-tune when:
- You need a consistent output format that prompting can't lock in
- Proprietary domain knowledge must be in the weights (can't be in context at inference time)
- Inference cost reduction — fine-tuned small model can match large model on narrow tasks
- Style: the model must sound like your brand, not like a generic assistant
Training Objectives
Supervised Fine-Tuning (SFT)
Train on (input, output) pairs. The model learns to produce outputs that look like the training examples. Simplest approach.
When to use: Style alignment, format compliance, instruction following for specific tasks.
Dataset size: 500–5,000 examples for narrow tasks. 50,000+ for broad capability improvements.
DPO (Direct Preference Optimisation)
Train on (prompt, preferred_response, rejected_response) triples. Optimises the policy directly to prefer one output over another. No reward model required.
Why DPO over PPO: No RL instability, no reward model to train separately, 2–3x cheaper to run. Near-equivalent results for most alignment tasks. See fine-tuning/dpo-grpo.
GRPO (Group Relative Policy Optimisation)
Used in DeepSeek-R1. Samples a group of responses for each prompt, ranks them, optimises toward better-ranked outputs. Strong for reasoning tasks (math, code). No value/critic model needed.
ORPO (Odds Ratio Preference Optimisation)
Combines SFT and preference learning in a single loss. Simpler pipeline than DPO. No separate SFT warmup required.
KTO (Kahneman-Tversky Optimisation)
Works with scalar labels (good/bad) rather than pairwise preferences. Useful when collecting preference pairs is hard.
LoRA: The Default PEFT Method
See fine-tuning/lora-qlora for full treatment. Short version:
- Freeze base model weights
- Add low-rank adapter matrices A (r × k) and B (d × r) to attention layers
- Only train A and B (~0.1–1% of total parameters)
- At inference, merge: W_effective = W + α·BA
Typical hyperparameters:
- Rank
r: 8–64 (higher = more capacity = more compute) - Alpha
α: typically 2× rank - Target modules:
q_proj, v_projminimum;q_proj, k_proj, v_proj, o_projfor full attention fine-tuning
QLoRA
LoRA on a 4-bit quantised base model. Enables fine-tuning 7B models on a single RTX 4070 Ti (12GB VRAM), 70B models on a single A100 80GB.
The base model is loaded in NF4 (4-bit NormalFloat quantisation). LoRA adapters are trained in bf16. During the backward pass, the base model is dequantised on-the-fly.
Quality penalty vs full LoRA: ~1–3% on most benchmarks. Almost always worth it for the hardware savings.
Frameworks
| Framework | Strength | When to use |
|---|---|---|
| Axolotl (v0.29) | Widest objective coverage, config-file driven | Production fine-tuning; supports all objectives |
| TRL | Canonical RLHF/DPO/PPO; HuggingFace ecosystem | When you need deep customisation |
| Unsloth | 2–4x faster training, lower memory | When speed/cost matters; wraps TRL |
| LLaMA-Factory | GUI + code; many models supported | Quick experiments, non-expert users |
| HuggingFace PEFT | Foundation library for LoRA/QLoRA | Used under the hood by all others |
Axolotl config example (minimal DPO):
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
datasets:
- path: my_preference_dataset.jsonl
type: chatml.intel
rl: dpo
learning_rate: 5e-5
num_epochs: 3Hardware Guide
| Model size | Method | Minimum VRAM |
|---|---|---|
| 7B–8B | QLoRA | 12GB (RTX 4070 Ti) |
| 7B–8B | LoRA (fp16) | 24GB (RTX 3090/4090) |
| 13B | QLoRA | 16GB (RTX 4080) |
| 70B | QLoRA | 48GB (2× A6000) |
| 70B | QLoRA | 80GB (A100 80GB) |
| 405B | QLoRA | 4× A100 80GB |
For most practical fine-tuning tasks: rent A100 40GB or H100 instances on Lambda Labs, RunPod, or Vast.ai. Single-run cost for 7B QLoRA: ~$5–15.
Evaluation After Fine-Tuning
Never deploy a fine-tuned model without comparing it to the base model on your eval suite. Common failure modes:
- Catastrophic forgetting — model gets better at the fine-tuned task but worse at everything else
- Sycophancy increase — preference tuning with reward hacking teaches the model to flatter
- Format overfitting — model applies the fine-tuned format to every response
Run evals/methodology before and after. A 2% improvement on the target task plus 5% degradation on general tasks is often not worth it.
Key Facts
- 57% of AI organisations do not fine-tune at all
- SFT dataset size: 500-5,000 examples for narrow tasks; 50,000+ for broad capability improvement
- QLoRA on 7B: 12GB VRAM (RTX 4070 Ti); 70B: 80GB (A100 80GB) or 2× A6000 48GB
- Single QLoRA run on a 7B model: ~$5-15 on rented A100
- Catastrophic forgetting is the most common post-fine-tuning failure — always run evals before and after
- Quality penalty for QLoRA vs full LoRA: ~1-3% on most benchmarks
Common Failure Cases
Fine-tuning a model for format compliance when few-shot prompting would have worked
Why: teams jump to fine-tuning for output formatting tasks (JSON, markdown tables) without first testing with 3-5 few-shot examples; fine-tuning adds cost and latency for something prompting handles reliably.
Detect: running the same task with 5 formatted examples in the prompt achieves >90% format compliance; the fine-tuned model adds only marginal improvement.
Fix: always test prompting + few-shot before fine-tuning; fine-tune only when format compliance with prompting falls below 85% on a representative test set.
SFT fine-tuning improves target task metrics but degrades general instruction-following capability
Why: catastrophic forgetting — training only on narrow domain examples reduces the model's performance on tasks not represented in the training set; the gradient updates overwrite general capabilities.
Detect: target task accuracy improves by 10-15% but MMLU or instruction-following benchmarks degrade 5-10% post-fine-tuning.
Fix: include 5-10% general instruction-following examples in the SFT dataset to preserve broad capability; use a lower learning rate and fewer epochs; evaluate on a general benchmark before and after.
Fine-tuning to inject proprietary knowledge into model weights when RAG would have been correct
Why: fine-tuning bakes a snapshot of knowledge into weights; when the underlying knowledge changes (prices, policies, product specs), the fine-tuned model is immediately stale and requires retraining.
Detect: fine-tuned model answers based on training-time information when users ask about recently updated facts; RAG with the same knowledge base answers correctly.
Fix: use RAG for knowledge that changes more frequently than your retraining cycle; reserve fine-tuning for stable stylistic or structural patterns.
QLoRA training chosen because of hardware constraints, but quality gap is unacceptable for production
Why: QLoRA's 1-3% quality penalty on most benchmarks is acceptable for general tasks but can be significant for narrow precision-critical tasks (legal, medical, code generation with correctness checks).
Detect: QLoRA-tuned model achieves 82% pass@1 on your code eval; full LoRA on the same adapter achieves 89%; the gap matters for your use case.
Fix: evaluate QLoRA vs full LoRA on your specific eval suite before committing to hardware; rent an A100 80GB for full LoRA if the quality gap is critical and the model is 7-13B.
Connections
- fine-tuning/lora-qlora — LoRA and QLoRA parameter mechanics in depth
- fine-tuning/dpo-grpo — preference optimisation algorithms (DPO, GRPO, ORPO, KTO)
- fine-tuning/frameworks — Axolotl, TRL, Unsloth setup and config
- rag/pipeline — the alternative to fine-tuning for knowledge tasks
- data/synthetic-data — generating fine-tuning datasets with LLMs
- evals/methodology — evaluating before and after to catch catastrophic forgetting
Open Questions
- At what point does prompting + RAG genuinely fail to match a fine-tuned model for format compliance?
- Is ORPO now the default over DPO for combined SFT+preference pipelines, or are results inconsistent?
- What is the minimum dataset size for QLoRA to measurably shift a 7B model's style without overfitting?
Related reading