AI Safety

AI safety landscape — Anthropic's RSP (ASL-1 through ASL-4 capability thresholds), Constitutional AI for harmlessness, mechanistic interpretability, and the four core alignment failure modes.

Updated Invalid Date·

safety alignment rsp red-teaming interpretability anthropic agi-safety

The technical and policy work to ensure powerful AI systems behave in ways that are safe and beneficial. Distinct from AI security (which is about protecting systems from external attackers) — safety is about the model itself.

The Core Problem

Advanced AI systems might pursue objectives that diverge from human values, either because:

Specification failure — the objective we trained doesn't actually capture what we want
Distribution shift — the model behaves well in training distribution but badly in novel situations
Deceptive alignment — a sufficiently capable model might learn to appear aligned during evaluation but pursue other goals in deployment
Emergent capabilities — capabilities the model has that we didn't intend and may not have evaluated

The concern is not that AI is "evil" but that misaligned objectives at high capability levels could cause irreversible harm.

Anthropic's Approach

Responsible Scaling Policy (RSP)

A public commitment to evaluate models for dangerous capabilities before deployment, and pause if a safety threshold is crossed.

AI Safety Levels (ASL):

Level	Description	Current status
ASL-1	No potential for large-scale harm	All current models below ASL-2 by definition
ASL-2	Marginal uplift to bioweapons; basic CBRN knowledge	Claude 4.x evaluated at ASL-2 boundary
ASL-3	Substantial uplift to CBRN weapons; advanced cyberattacks	Threshold that would trigger deployment pause
ASL-4	Weapons of mass destruction, destabilising capabilities	Theoretical; not yet approached

If a model passes ASL-3 evaluations, Anthropic commits to either deploy with additional mitigations or not deploy at all until mitigations are in place.

Constitutional AI

See safety/constitutional-ai. The training methodology that produces Claude's harmlessness.

Mechanistic Interpretability

Understanding what computations are happening inside the model. See safety/mechanistic-interpretability.

Red Teaming

Systematically trying to elicit dangerous capabilities or alignment failures from models.

Types:

Automated red-teaming — LLM generates adversarial prompts, tests them, iterates
Human red-teaming — domain experts (biosecurity, cybersecurity, etc.) probe for specific uplift
Capability elicitation — finding whether a model has a capability, even if it's hidden

Evaluation criteria:

Does the model provide meaningful uplift (assistance beyond what's freely available)?
Is the assistance specific enough to be actionable?
Does the model refuse when it should?

False negatives (model has capability, doesn't reveal it) are as dangerous as true positives. Capability elicitation aims to find the maximum capability, not the average behaviour.

Alignment Research Areas

Scalable Oversight

How do humans supervise AI systems that are smarter than them? If the model can write code that humans can't review, how do we know if the code is correct/safe?

Approaches:

Debate — two AIs argue, human judges; the cheating AI's argument will be harder to defend
Amplification — recursively break tasks into sub-tasks humans can evaluate
Constitutional AI — AI judges AI using a written constitution

Superalignment

OpenAI's (now disbanded) team's goal: use AI to help align more powerful AI. The idea: a present-day aligned model trains a future more powerful model. Currently theoretical.

Model Organisms of Misalignment

Building models that deliberately exhibit misalignment to study it. If we can create a model that deceptively aligns, we can study how to detect and prevent it.

Interpretability for Safety

Using mechanistic interpretability findings to build safety tools: feature-based classifiers, activation steering, circuit-level understanding of refusal behaviour.

Safety vs Helpfulness Trade-off

Overtly safe models are less useful. Overly helpful models are more dangerous. Anthropic's thesis: this is mostly a false trade-off at the frontier. A well-designed AI can be both helpful and safe. Evidence: Claude scores high on both helpfulness benchmarks and safety evaluations.

The real tension is at the edge cases: requests with both legitimate and illegitimate uses. Claude handles this by considering the full population of people likely to ask a given question.

Other Labs' Approaches

Lab	Approach	Key commitments
Anthropic	RSP, CAI, interpretability, pause commitment	Most safety-explicit
OpenAI	Staged deployment, red-teaming, O-series reasoning	Safety team conflict (March 2023 departure of Ilya)
Google DeepMind	Safety research, EU AI Act compliance	AlphaProof: maths verification
Meta FAIR	Responsible release, safety paper publishing	Open weights with some capability restrictions

Key Facts

RSP ASL levels: ASL-2 (marginal CBRN uplift), ASL-3 (substantial CBRN uplift — deployment pause trigger), ASL-4 (WMD-level — theoretical)
Claude 4.x: evaluated at ASL-2 boundary
Four alignment failure modes: specification failure, distribution shift, deceptive alignment, emergent capabilities
Capability elicitation: false negatives (hidden capability) as dangerous as true positives — red-teaming finds maximum capability, not average
Safety vs helpfulness: Anthropic's thesis is mostly a false trade-off; Claude scores high on both

Connections

safety/constitutional-ai — Anthropic's training methodology for harmlessness
safety/mechanistic-interpretability — understanding model internals
llms/claude — RSP levels for Claude models
security/owasp-llm-top10 — external attack threat model (distinct from safety)
security/red-teaming — red-teaming methodology
safety/scalable-oversight — designing oversight that works even when AI exceeds human capability

Open Questions

How will ASL-3 evaluation criteria evolve as models become more capable at CBRN-adjacent tasks?
Is deceptive alignment empirically observable in current frontier models, or still theoretical?
Does interpretability (reading activations) provide a reliable path to detecting model misalignment before deployment?