Direct Preference Optimization (Rafailov et al., 2023)
DPO shows that the RLHF reward model and PPO optimisation loop can be eliminated — the LLM itself encodes an implicit reward function, allowing direct optimisation on preference pairs with a simple classification-style loss.
Citation: Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
One sentence: DPO shows that the RLHF reward model and PPO optimisation loop can be eliminated — the LLM itself encodes an implicit reward function, allowing direct optimisation on preference pairs with a simple classification-style loss.
What Problem It Solved
Standard RLHF requires three separate stages: SFT → reward model training → PPO fine-tuning. PPO is notoriously unstable and computationally expensive. The reward model is a separate model that can diverge from true human preferences.
DPO collapses the reward model + PPO into a single supervised loss applied directly to the language model, using only (prompt, chosen, rejected) triplets.
Core Insight — The LM Is the Reward Model
RLHF optimises the policy π to maximise expected reward under a KL constraint:
max_π E[r(x,y)] - β · KL(π || π_ref)
The paper shows this optimisation has an analytic solution:
r*(x,y) = β · log(π*(y|x) / π_ref(y|x)) + β · log Z(x)
The optimal reward function is implicit in the ratio of policy probabilities. Plugging this back into the Bradley-Terry preference model (humans prefer y₁ over y₂ with probability σ(r*(y₁) - r*(y₂))), the partition function Z cancels:
The DPO loss:
L_DPO(π_θ; π_ref) = -E[(x, y_w, y_l)] [
log σ(
β · log(π_θ(y_w|x) / π_ref(y_w|x))
- β · log(π_θ(y_l|x) / π_ref(y_l|x))
)
]
Where:
- y_w = the preferred (winning) response
- y_l = the rejected (losing) response
- π_ref = reference (SFT) model — frozen
- β = temperature controlling KL constraint
The loss increases the likelihood of preferred responses and decreases the likelihood of rejected responses, relative to the reference model.
Training Procedure
- Start with an SFT model (same as RLHF Stage 1)
- Collect preference data: (prompt, chosen_response, rejected_response) triplets
- Compute the DPO loss with the SFT model as the reference
- Optimise with Adam — standard supervised training
No reward model. No PPO. No KL penalty as a separate term. It's implicit in the loss.
Comparison to RLHF
| RLHF | DPO | |
|---|---|---|
| Stages | SFT → RM training → PPO | SFT → DPO loss |
| Reward model | Explicit (separate model) | Implicit (in π ratio) |
| Optimiser | PPO (complex, unstable) | Adam (standard) |
| Memory | 2-4 models in memory | 2 models (policy + frozen ref) |
| Stability | Low (PPO sensitive to lr) | High (supervised loss) |
| Data requirement | Preference pairs for RM, then PPO rollouts | Preference pairs only |
| Compute | High | ~2× SFT training |
Impact
- Became the dominant alignment method for open-source fine-tuning within 12 months of publication
- Implemented in TRL (
DPOTrainer), Axolotl (dpotraining type), and all major fine-tuning frameworks - Led to variants: IPO (Identity Preference Optimisation), KTO (Kahneman-Tversky Optimisation), ORPO (Odds Ratio Preference Optimisation), SimPO
- Used to train Llama 2 Chat, Mistral-Instruct, Zephyr, and most open instruction-following models
- GRPO (DeepSeek-R1, 2025): group relative policy optimisation, a further evolution for reasoning
Limitations
- Requires high-quality preference data — poor labels produce poor alignment
- No online learning — uses fixed offline preference dataset; can't improve with live human feedback
- Reference model dependency — the KL constraint relative to π_ref can limit how far the policy can move
- Doesn't handle multi-turn as naturally as RL approaches
Key Facts
- Published May 2023 (arXiv); NeurIPS 2023; Stanford + UC Berkeley authors
- Loss function: binary cross-entropy on log-ratio of policy vs reference probabilities
- β (temperature): typically 0.1–0.5; higher β → stronger KL constraint → stays closer to reference
- TRL implementation:
DPOTrainer(model, ref_model, args, train_dataset, tokenizer) - Dataset format:
{"prompt": "...", "chosen": "...", "rejected": "..."} - GRPO (2025): removes the reference model entirely; uses group rewards relative to other responses
Connections
papers/key-papers · papers/rlhf · papers/constitutional-ai · fine-tuning/lora-qlora · fine-tuning/frameworks · math/optimisation
Open Questions
- What claims in this paper have since been challenged or superseded by follow-up work?
- What did later research reveal about the limitations of this approach?
Related reading
More in Papers