Safety

7 pages

AI Alignment Overview

Hub page for AI alignment — the set of techniques for ensuring AI systems do what humans intend. Covers RLHF, Constitutional AI, scalable oversight, and the open research problems that remain unsolved.

alignmentsafetyrlhfconstitutional-ai

AI Safety

AI safety landscape — Anthropic's RSP (ASL-1 through ASL-4 capability thresholds), Constitutional AI for harmlessness, mechanistic interpretability, and the four core alignment failure modes.

safetyalignmentrspred-teaming

Constitutional AI (CAI)

CAI replaces human harmlessness labellers with AI self-critique guided by a public 16-principle constitution — Phase 1 (SL-CAF) generates revised responses, Phase 2 (RLAIF) generates preference labels.

constitutional-airlhfalignmentanthropic

Mechanistic Interpretability

Sparse autoencoders decompose polysemantic neurons into millions of monosemantic features — Scaling Monosemanticity (2024) applied SAEs to Claude 3 Sonnet; activation steering enables direct behavioural intervention.

interpretabilitymechanistic-interpretabilityfeaturescircuits

Red Teaming Methodology

Red teaming is structured adversarial testing — humans or automated systems trying to break an AI model before deployment. It finds failure modes that benchmarks miss and is required for any serious safety programme.

red-teamingsafetyadversarialevaluation

Responsible AI

Responsible AI — the FATE framework (Fairness, Accountability, Transparency, Explainability) plus safety, privacy, and robustness. AWS tooling: Clarify (bias/SHAP), Guardrails (safety), A2I (human oversight), Model Cards (accountability). AIF-C01 Domain 4 core.

responsible-aifairnesstransparencyaccountability

Scalable Oversight

Scalable oversight designs verification mechanisms that work even when the AI is smarter than the human checking it. Key approaches — debate, critique, recursive reward modeling, prover-verifier games — are Anthropic's second-highest priority research area as of 2025.

scalable-oversightalignmentdebateamplification