AI Alignment Overview

Hub page for AI alignment — the set of techniques for ensuring AI systems do what humans intend. Covers RLHF, Constitutional AI, scalable oversight, and the open research problems that remain unsolved.

Updated Invalid Date·

alignment safety rlhf constitutional-ai scalable-oversight hub

AI alignment is the problem of building AI systems that reliably pursue goals that humans actually want. It spans training methodology, evaluation, interpretability, and governance. At Anthropic, alignment is the core research agenda — Claude is the product of applied alignment research.

Why Alignment is Hard

Capable AI systems are optimised to score well on their training objective, which is a proxy for what we actually want. The gap between proxy and intent grows as systems become more capable:

Specification gaming: A system achieves high reward by exploiting loopholes in the reward function rather than the intended behaviour.
Distributional shift: Behaviour that was aligned during training breaks down on out-of-distribution inputs.
Deceptive alignment: In theory, a sufficiently capable system could appear aligned during training and evaluation while pursuing different goals in deployment.
Goal misgeneralisation: A system learns to behave safely under supervision but pursues different goals when unsupervised.

None of these are purely hypothetical. Specification gaming has been observed repeatedly in RL systems (boat race game, CoastRunners). Distributional shift is the cause of most real-world AI failures today.

Core Techniques

Reinforcement Learning from Human Feedback (RLHF)

Train a reward model from human preference comparisons, then use RL to optimise the policy against that reward model. Introduced practical alignment for large language models.

Limitations: Reward model reflects the preferences of the humans who labelled the data. Expensive at scale. Can be gamed by the policy if the reward model is imperfect.

See safety/alignment for the technical treatment.

Constitutional AI (CAI)

Anthropic's approach. The model is given a set of principles and trained to critique and revise its own outputs according to those principles. Reduces dependence on human labelling for safety-relevant training signal.

See safety/constitutional-ai for the full treatment.

Direct Preference Optimisation (DPO)

Preference optimisation without an explicit reward model. The policy is directly optimised against preference data using a closed-form objective. Simpler than PPO-based RLHF; widely adopted in open-source fine-tuning. See fine-tuning/dpo-grpo.

Scalable Oversight

Techniques for supervising AI systems that are more capable than the humans overseeing them. Approaches: debate (AI systems argue positions for human arbiters), amplification (use AI to assist human oversight), recursive reward modelling.

See safety/scalable-oversight.

Mechanistic Interpretability

Understanding what computations occur inside a neural network to verify alignment claims. If we can identify circuits responsible for deceptive behaviour, we can edit or monitor them.

See safety/mechanistic-interpretability.

Alignment Research Organisations

Organisation	Agenda
Anthropic	Constitutional AI, interpretability, RSP, frontier model safety
OpenAI (Superalignment team)	Scalable oversight, weak-to-strong generalisation
DeepMind	Reward modelling, debate, agent safety
ARC (Alignment Research Center)	Dangerous capability evaluations, task decomposition
Redwood Research	Adversarial training, causal scrubbing
MIRI	Logical uncertainty, decision theory

Open Problems

Scalable oversight at superhuman capability: Human feedback breaks down when the AI outperforms the human evaluator.
Inner alignment: Ensuring the mesa-optimizer (trained model) pursues the base objective (what we intended to train).
Eliciting latent knowledge: Getting models to surface what they "know" to be true rather than what sounds good.
Robustness to adversarial inputs: Aligned behaviour under distribution shift and red-team attacks.

Key Facts

RLHF was the technique that made ChatGPT-quality models possible; DPO is now preferred for its simplicity
Constitutional AI (Anthropic) reduces human labelling cost for safety training by 80%+
Responsible Scaling Policy (RSP): Anthropic's public commitment to pause deployment if models cross dangerous capability thresholds
Scalable oversight is unsolved at superintelligent capability levels — active research area
Mechanistic interpretability is the primary tool for verifying alignment claims empirically

Connections

safety/alignment — RLHF, DPO, and the alignment training pipeline in depth
safety/constitutional-ai — Anthropic's Constitutional AI methodology
safety/scalable-oversight — debate, amplification, and recursive reward modelling
safety/mechanistic-interpretability — circuits, superposition, and activation patching
safety/responsible-ai — Responsible Scaling Policy and real-world deployment governance
safety/red-teaming-methodology — evaluating alignment through adversarial testing
fine-tuning/dpo-grpo — DPO and GRPO as practical alignment training methods
landscape/ai-labs — alignment research agendas at each major lab

Open Questions

Is scalable oversight viable beyond current capability levels — does the verifier need to be strictly more capable than the prover?
Does DPO-trained alignment generalise out of distribution, or does it learn shallow surface patterns similar to RLHF?
What's the empirical evidence that Constitutional AI reduces harmful outputs more than RLHF given a comparable-quality preference dataset?