RLHF: Reinforcement Learning from Human Feedback
Two papers that define RLHF as an alignment technique: Stiennon et al. (2020) demonstrated it at scale for summarisation; Ouyang et al.
Two papers that define RLHF as an alignment technique: Stiennon et al. (2020) demonstrated it at scale for summarisation; Ouyang et al. (2022, InstructGPT) applied it to make GPT-3 follow instructions.
Stiennon et al. (2020) — Learning to Summarise with Human Feedback
Citation: Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., ... & Christiano, P. F. (2020). Learning to summarise with human feedback. NeurIPS 2020.
One sentence: Collect human preferences between pairs of model summaries, train a reward model on those preferences, then use PPO to fine-tune the language model to maximise the reward.
The Three-Stage RLHF Pipeline
This paper formalised RLHF for language models. The pipeline is the foundation of all subsequent alignment work:
Stage 1 — Supervised Fine-Tuning (SFT) Fine-tune the base model on high-quality demonstrations of the target behaviour. Gives the model a starting point in the right distribution.
Stage 2 — Reward Model Training Collect human preference data: show a labeller two model outputs, ask which is better. Train a reward model (RM) to predict human preferences:
RM(x, y) → scalar reward score
where x is the input and y is the model output. The RM learns to assign higher scores to outputs humans prefer.
Stage 3 — RL Fine-Tuning with PPO Use Proximal Policy Optimisation to fine-tune the SFT model to maximise expected reward:
objective = E[RM(x, y)] - β · KL(π_RL || π_SFT)
The KL penalty prevents the model from drifting too far from the SFT policy. Stops reward hacking.
InstructGPT (Ouyang et al., 2022) — RLHF at Scale for Instruction Following
Citation: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
One sentence: Applying the RLHF pipeline to GPT-3 with 40 contractors labelling instruction-following quality produced InstructGPT — a 1.3B model that humans preferred to the 175B raw GPT-3.
Key Findings
A 1.3B model outperformed 175B GPT-3. Human raters preferred InstructGPT-1.3B outputs to raw GPT-3-175B outputs 85% of the time. Alignment matters more than raw scale for practical usefulness.
The labelling task: Contractors were given prompts and asked to rank pairs of outputs by helpfulness, harmlessness, and honesty. ~13,000 labelled comparisons were used for the reward model.
Alignment tax is small. RLHF slightly degraded performance on some NLP benchmarks (the model was optimised for human preferences, not benchmark metrics). But the absolute drop was small — <5% on most tasks.
Instruction Categories Tested
- Question answering
- Summarisation
- Open-ended generation
- Classification
- Brainstorming
- Code generation
InstructGPT improved substantially across all categories over raw GPT-3.
Why RLHF Works (And Its Problems)
Why it works:
- The base model has capability but no alignment to human intent
- SFT teaches the format; reward model teaches the values; PPO optimises for both
- Human preferences encode nuance that is difficult to specify as a rule
Problems:
| Problem | Description |
|---|---|
| Reward hacking | Model exploits the RM: long verbose answers score well even if unhelpful |
| RM collapse | RM preferences diverge from true human values after many updates |
| Labeller variance | Different labellers have different standards; preferences are noisy |
| Scalability | Human labelling is expensive; doesn't scale to all tasks |
| RLHF instability | PPO is notoriously sensitive to hyperparameters |
DPO (2023) removed the explicit reward model and PPO, replacing them with a direct loss on preference pairs. Constitutional AI (2022) replaced human harm labels with AI-generated labels.
Key Facts
- Stiennon et al. (2020): first large-scale RLHF for language models; demonstrated on TL;DR summarisation
- InstructGPT (2022): ChatGPT's direct predecessor; 40 labellers, ~13K preference comparisons
- Three stages: SFT → Reward Model → PPO fine-tuning
- KL penalty: β · KL(π_RL || π_SFT) prevents reward hacking
- 1.3B InstructGPT > 175B GPT-3 by human preference — alignment > scale for practical use
- Replaced by DPO (simpler, more stable); supplemented by Constitutional AI (scalable labels)
Connections
papers/key-papers · papers/constitutional-ai · papers/dpo · fine-tuning/rlhf-dpo · fine-tuning/frameworks · safety/constitutional-ai
Open Questions
- What claims in this paper have since been challenged or superseded by follow-up work?
- What did later research reveal about the limitations of this approach?
Related reading
More in Papers