Fine-Tuning

5 pages

DPO, GRPO, and Preference Optimisation

DPO is now the standard preference optimisation method (no reward model needed, 2-3x cheaper than PPO); GRPO from DeepSeek-R1 is the frontier method for verifiable reasoning tasks like math and code.

dpogrpoorpoppo

Fine-Tuning

Fine-tuning decision framework — 57% of AI organisations never fine-tune; the decision tree runs prompting, then RAG, then SFT, then DPO/GRPO, escalating only when the prior approach genuinely fails.

fine-tuningloraqloradpo

Fine-Tuning Frameworks

Fine-tuning framework selection guide — Axolotl for production (config-file driven, all objectives), TRL for custom training loops, Unsloth for maximum single-GPU speed; all build on HuggingFace PEFT underneath.

axolotltrlunslothpeft

LoRA and QLoRA

LoRA fine-tunes 0.1-1% of model parameters by adding low-rank adapter matrices to frozen weights; QLoRA adds 4-bit quantisation of the base model, enabling 7B fine-tuning on a 12GB consumer GPU.

loraqlorapeftadapters

RLHF and DPO

RLHF trains a reward model from human preferences then uses PPO to optimise against it — powerful but complex. DPO skips the reward model entirely and optimises directly on preference pairs, making it 3-5x simpler with comparable results on most tasks.

rlhfdpoppopreference-optimisation