Attention Is All You Need (Vaswani et al., 2017)
The 2017 paper that replaced RNNs with parallel self-attention — enabling BERT, GPT, and every LLM since; key changes from 2017 to 2026 (RoPE, Pre-LN, SwiGLU, GQA).
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
Adding intermediate reasoning steps to few-shot examples — "chain-of-thought" — dramatically improves LLM performance on multi-step reasoning tasks, but only emerges at large model scale (~100B parameters).
Constitutional AI: Harmlessness from AI Feedback (Bai et al., Anthropic, 2022)
Instead of collecting human labels for harmful outputs, train a model to critique and revise its own responses using a written set of principles (a "constitution"), then use those AI-generated preference labels to train the final model.
Direct Preference Optimization (Rafailov et al., 2023)
DPO shows that the RLHF reward model and PPO optimisation loop can be eliminated — the LLM itself encodes an implicit reward function, allowing direct optimisation on preference pairs with a simple classification-style loss.
GPT-3: Language Models are Few-Shot Learners (Brown et al., 2020)
Scaling a decoder-only Transformer to 175B parameters with 300B tokens of training data produced a model that could perform new tasks from a handful of examples in the prompt — without any gradient updates.
GPT-4 Technical Report
OpenAI (2023) — GPT-4 is a large-scale multimodal model trained with RLHF. Passes the bar exam in the top 10%, demonstrates emergent capabilities, and introduces a systematic safety evaluation methodology with a published system card. The template for how frontier labs now report model capabilities.
Key Papers Reading List
Curated reading list for senior AI engineers — 22 papers across architecture, alignment, reasoning, RAG, efficient training, safety, and scaling, with a one-day and one-week priority order.
Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023)
Llama 2 (Touvron et al., Meta + Microsoft, July 2023) adds RLHF-tuned chat models (7B–70B), doubles the pretraining budget to 2T tokens, extends context to 4096 tokens, and introduces Ghost Attention for multi-turn consistency — with a commercial licence covering up to 700M monthly users.
LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)
LLaMA (Touvron et al., Meta AI, Feb 2023) proved that a 13B model trained on public data only can outperform GPT-3 (175B) on most benchmarks — igniting the open-source LLM ecosystem.
LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
Instead of fine-tuning all model weights, freeze the original weights and inject trainable low-rank decomposition matrices into each attention layer — achieving 10,000× fewer trainable parameters with no inference overhead.
Mechanistic Interpretability — Core Papers
The research programme of understanding what computations neural networks actually implement.
Mistral 7B and Mixtral 8x7B
Mistral 7B (Oct 2023) introduced SWA and GQA to beat Llama 2 13B at 7B parameters; Mixtral 8x7B (Dec 2023) applied sparse MoE — 8 experts, 2 active per token — to match GPT-3.5 Turbo with 12.9B active from 46.7B total parameters.
ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)
Interleave chain-of-thought reasoning with tool-use actions in a single generation loop — the model reasons about what to do, takes an action, observes the result, reasons again — enabling LLMs to complete tasks requiring external information retrieval.
RLHF: Reinforcement Learning from Human Feedback
Two papers that define RLHF as an alignment technique: Stiennon et al. (2020) demonstrated it at scale for summarisation; Ouyang et al.
Scaling Laws for Neural Language Models (Kaplan et al., 2020) + Chinchilla (Hoffmann et al., 2022)
Two papers that define how LLM performance scales with compute, parameters, and data. Chinchilla corrected a key mistake in Kaplan and changed how all subsequent models are trained.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2024)
A benchmark of 2,294 real GitHub issues from 12 popular Python repositories — to resolve each issue, a model must understand a full codebase, write a patch, and pass the existing test suite.
Toolformer: Language Models Can Teach Themselves to Use Tools
Schick et al. (Meta, 2023) — language models can teach themselves to call external APIs by self-generating training data. The conceptual origin of tool use in LLMs before ChatGPT plugins or function calling.