Attention Is All You Need (Vaswani et al., 2017)

The 2017 paper that replaced RNNs with parallel self-attention — enabling BERT, GPT, and every LLM since; key changes from 2017 to 2026 (RoPE, Pre-LN, SwiGLU, GQA).

The paper that introduced the Transformer architecture. Published June 2017. Every large language model in existence descends from this work.

Citation: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.


What It Proposed

Before this paper, sequence-to-sequence models (machine translation, text generation) used RNNs (LSTMs, GRUs). RNNs process sequences token by token. Fundamentally sequential, cannot parallelise.

The Transformer dispenses with recurrence entirely. It processes entire sequences in parallel using attention mechanisms. This unlocked massive parallelisation on GPUs.


Key Contributions

1. Self-Attention

For every token in a sequence, compute attention scores against every other token. The token attends to relevant context regardless of distance.

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Compared to RNNs:

  • RNN: information about token at position 1 must travel through all subsequent hidden states to reach position 512. Distance = signal degradation.
  • Transformer: position 1 can directly attend to position 512. No distance penalty.

2. Multi-Head Attention

Run H attention operations in parallel, each learning different relationships (syntactic, semantic, co-reference). Concatenate outputs.

"The cat sat on the mat". One head attends to subject-verb relationships, another to noun-pronoun co-reference, another to positional proximity.

3. Positional Encoding

Since attention has no sequential ordering, inject position information via sinusoidal encoding:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Added to token embeddings. Modern models use RoPE instead (learned, extrapolates better). The paper's sinusoidal encoding is now a historical artefact.

4. Encoder-Decoder Architecture

The original Transformer was designed for translation (encoder encodes source, decoder generates target). For language modelling (GPT family), only the decoder is used. For understanding tasks (BERT), only the encoder.

Modern LLMs are decoder-only Transformers with causal (masked) attention. Each token only attends to prior tokens.

5. Feed-Forward Sublayers

After attention, each token passes through a position-wise FFN independently. This is where factual knowledge is stored (see llms/transformer-architecture).


Impact

  • Replaced RNNs and CNNs as the default architecture for NLP tasks
  • Enabled BERT (2018), GPT (2018), GPT-2 (2019), GPT-3 (2020), and every subsequent LLM
  • Extended to vision (ViT, 2020), audio, protein structure prediction (AlphaFold 2)
  • ~120,000 citations as of 2026 — among the most cited papers in computer science history

What's Changed Since 2017

The core architecture survives intact. The engineering details have evolved:

Component2017 paperModern LLMs
Positional encodingSinusoidal (fixed)RoPE (learned, extrapolates)
NormalisationPost-LayerNormPre-LayerNorm (stable training)
ActivationReLUSwiGLU
FFN width4× model dim4× (unchanged)
AttentionMulti-headMulti-head + GQA (efficiency)
ArchitectureEncoder-decoderDecoder-only (generative models)

Key Facts

  • Published: June 2017, NeurIPS 2017
  • Authors: 8 Google Brain / Research authors; Noam Shazeer was co-author
  • Citations: ~120,000 as of 2026 — among the most cited papers in computer science
  • Core formula: Attention(Q,K,V) = softmax(QK^T / √d_k) · V
  • Modern LLMs are decoder-only transformers with causal attention — only the decoder half of the original design
  • What changed: sinusoidal PE → RoPE; Post-LN → Pre-LN; ReLU → SwiGLU; MHA → MHA+GQA
  • Extended to: ViT (vision, 2020), AlphaFold 2 (protein, 2021), audio models

Connections

Open Questions

  • Will the decoder-only architecture remain dominant through the next generation of frontier models, or will encoder-decoder hybrids return for specific tasks?
  • The original sinusoidal encoding is now replaced by RoPE — are there properties of sinusoidal encoding that are actually lost?
  • How does the transition from ReLU to SwiGLU activation affect the interpretability of FFN layers?