DPO, GRPO, and Preference Optimisation
DPO is now the standard preference optimisation method (no reward model needed, 2-3x cheaper than PPO); GRPO from DeepSeek-R1 is the frontier method for verifiable reasoning tasks like math and code.
Fine-Tuning
Fine-tuning decision framework — 57% of AI organisations never fine-tune; the decision tree runs prompting, then RAG, then SFT, then DPO/GRPO, escalating only when the prior approach genuinely fails.
Fine-Tuning Frameworks
Fine-tuning framework selection guide — Axolotl for production (config-file driven, all objectives), TRL for custom training loops, Unsloth for maximum single-GPU speed; all build on HuggingFace PEFT underneath.
LoRA and QLoRA
LoRA fine-tunes 0.1-1% of model parameters by adding low-rank adapter matrices to frozen weights; QLoRA adds 4-bit quantisation of the base model, enabling 7B fine-tuning on a 12GB consumer GPU.
RLHF and DPO
RLHF trains a reward model from human preferences then uses PPO to optimise against it — powerful but complex. DPO skips the reward model entirely and optimises directly on preference pairs, making it 3-5x simpler with comparable results on most tasks.