GPT-3: Language Models are Few-Shot Learners (Brown et al., 2020)
Scaling a decoder-only Transformer to 175B parameters with 300B tokens of training data produced a model that could perform new tasks from a handful of examples in the prompt — without any gradient updates.
Citation: Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. NeurIPS 2020.
One sentence: Scaling a decoder-only Transformer to 175B parameters with 300B tokens of training data produced a model that could perform new tasks from a handful of examples in the prompt — without any gradient updates.
What Problem It Solved
Pre-GPT-3, adapting a language model to a new task required fine-tuning: collecting labelled data, running gradient updates, maintaining a separate checkpoint per task. This was expensive and inflexible.
GPT-3 showed that a large enough model, given a few examples in its context window, could perform competitively on new tasks with zero weight updates. The "programming interface" shifted from fine-tuning to prompting.
Key Contributions
1. In-Context Learning (ICL)
Provide examples of the desired behaviour directly in the prompt. The model generalises from those examples without updating its weights.
Translate English to French:
sea otter → loutre de mer
peppermint → menthe poivrée
cheese → ?
The model outputs "fromage". No fine-tuning required. This is few-shot learning (2–5 examples in prompt). Zero-shot omits examples entirely.
2. Scale Unlocks Capability
GPT-3 demonstrated that capability is not just a function of architecture. It is a function of scale. Three settings tested:
| Setting | Description | Works well at |
|---|---|---|
| Zero-shot | Just instruction, no examples | Large models only |
| One-shot | One example | Medium+ models |
| Few-shot | 10–100 examples | Consistently good at 175B |
Smaller models (GPT-2, 1.3B) showed poor few-shot generalisation. Performance jumps non-linearly with scale. An early hint at emergent abilities.
3. 175B Parameters, 300B Training Tokens
GPT-3 was trained on a mixture of Common Crawl (filtered), WebText2, Books1, Books2, and English Wikipedia. This showed that data quality matters as much as scale. Filtered web data outperforms unfiltered.
4. Decoder-Only Architecture
GPT-3 is a pure decoder-only Transformer. Causal attention masks prevent tokens attending to future positions. Every modern generative LLM (GPT-4, Claude, Llama, Mistral) uses this architecture.
Impact
- Made "prompt engineering" a legitimate discipline
- Demonstrated that a single general model could match fine-tuned task-specific models
- Launched the commercial LLM race (ChatGPT, Claude, Gemini all descend conceptually from this work)
- Revealed in-context learning as an emergent behaviour not present in smaller models
- The scaling recipe (more parameters + more data = more capability) held — and was later refined by Chinchilla
Limitations
- No weight updates — ICL is not learning; performance degrades on complex multi-step tasks
- Context window bound — can only use as many examples as fit in the context (2,048 tokens in 2020)
- Hallucination — the model confabulates factual answers confidently
- Prompt sensitivity — performance varies dramatically with example order and phrasing
- Not instruction-tuned — raw GPT-3 is hard to interact with conversationally (InstructGPT fixed this)
Key Facts
- 175B parameters; 300B training tokens; 96 attention layers; trained by OpenAI
- Released via API only (not open-sourced); paper published June 2020
- Few-shot GPT-3 matched fine-tuned BERT on SuperGLUE without any fine-tuning
- The direct lineage: GPT-3 → InstructGPT (RLHF) → ChatGPT → GPT-4
- Zero-shot works poorly on GPT-3; few-shot substantially closes the gap
Connections
papers/key-papers · papers/scaling-laws · papers/rlhf · llms/transformer-architecture · prompting/techniques · llms/model-families
Open Questions
- What claims in this paper have since been challenged or superseded by follow-up work?
- What did later research reveal about the limitations of this approach?
Related reading
More in Papers