AI Alignment Overview
Hub page for AI alignment — the set of techniques for ensuring AI systems do what humans intend. Covers RLHF, Constitutional AI, scalable oversight, and the open research problems that remain unsolved.
AI Safety
AI safety landscape — Anthropic's RSP (ASL-1 through ASL-4 capability thresholds), Constitutional AI for harmlessness, mechanistic interpretability, and the four core alignment failure modes.
Constitutional AI (CAI)
CAI replaces human harmlessness labellers with AI self-critique guided by a public 16-principle constitution — Phase 1 (SL-CAF) generates revised responses, Phase 2 (RLAIF) generates preference labels.
Mechanistic Interpretability
Sparse autoencoders decompose polysemantic neurons into millions of monosemantic features — Scaling Monosemanticity (2024) applied SAEs to Claude 3 Sonnet; activation steering enables direct behavioural intervention.
Red Teaming Methodology
Red teaming is structured adversarial testing — humans or automated systems trying to break an AI model before deployment. It finds failure modes that benchmarks miss and is required for any serious safety programme.
Responsible AI
Responsible AI — the FATE framework (Fairness, Accountability, Transparency, Explainability) plus safety, privacy, and robustness. AWS tooling: Clarify (bias/SHAP), Guardrails (safety), A2I (human oversight), Model Cards (accountability). AIF-C01 Domain 4 core.
Scalable Oversight
Scalable oversight designs verification mechanisms that work even when the AI is smarter than the human checking it. Key approaches — debate, critique, recursive reward modeling, prover-verifier games — are Anthropic's second-highest priority research area as of 2025.