Train harmless AI using AI feedback instead of human labels — a constitution of principles guides self-critique, revision, and preference learning at scale.
By 2022, RLHF had become the dominant recipe for making language models safe and helpful. The pipeline was clear: collect thousands of human preference labels comparing model responses, train a reward model on those labels, then optimize the LM with PPO against that reward.
But there was a deep problem with this approach, specifically for harmlessness.
Teaching a model to be harmless required an army of crowdworkers to read toxic, offensive, and dangerous model outputs, then label which response was "less harmful." This was expensive, slow, psychologically taxing, and didn't scale. Worse, the resulting models had a frustrating failure mode: when trained too aggressively for harmlessness, they became evasive.
Ask a sensitive question like "Why are prisons full of Black and Brown people?" and the model would respond: "Sorry, I cannot respond to this content." Not harmful, sure. But also completely useless. The crowdworkers had learned to reward evasion as the safest strategy, and the model internalized that.
Click each approach to see the supervision cost vs. harmlessness quality tradeoff. RLHF requires massive human labeling effort that still produces evasive models.
The core issues with RLHF for harmlessness:
Here is the idea that changes everything: instead of hiring humans to label harmful outputs, what if we asked the AI itself to evaluate harmfulness, guided by a set of written principles?
Think about it. A large language model already "knows" that helping someone make a bomb is wrong. It can articulate why racism is harmful. It can explain what makes a response evasive versus genuinely helpful. The knowledge is there in the weights from pretraining. The problem was that we had no way to extract and apply this knowledge systematically during training.
Constitutional AI solves this with a radically simple idea:
The method has two phases, directly mirroring how a person might improve:
The human's role shifts from labeling hundreds of thousands of examples to writing ~16 principles. That's it. A few dozen sentences replace an entire crowdworker operation.
The full Constitutional AI pipeline. SL phase bootstraps good behavior; RL phase refines it with AI-generated preference labels.
What exactly goes into the constitution? It's a collection of principles — short natural language instructions that tell the model what to look for when evaluating harmfulness. The paper used 16 principles for the SL phase and 16 for the RL phase.
Here are some actual principles from the paper, used during the SL critique phase:
And for the RL preference evaluation phase:
A few critical design choices stand out:
This is the heart of the supervised stage, and it's beautifully simple. Let's walk through a concrete example from the paper, step by step.
Start with a helpful-only RLHF model (trained to follow instructions, not trained to refuse harmful requests). Feed it a red-team prompt designed to elicit harmful content:
The model happily complies because it was only trained to be helpful, not harmless.
Now we append a critique request, randomly drawn from the constitution:
Then we ask for a revision:
Notice the revision is not evasive. It doesn't say "I can't help with that." It explains why the request is problematic. This is exactly the behavior we want.
This critique-revision cycle can be applied multiple times, with a fresh principle sampled each time. The paper found that harmlessness improves monotonically with each revision (measured by preference model scores), though with diminishing returns after 2-3 rounds.
Step through the full critique-revision cycle. Click "Next Step" to advance. Watch how each revision progressively removes harmful content while maintaining engagement.
The final revised responses (paired with their original prompts) become supervised training data. The paper used 182,831 red-team prompts with 4 revisions each, plus 135,296 helpfulness prompts to maintain the model's general capabilities. Finetuning a pretrained model on this data produces the SL-CAI model.
A critical finding: critiques improve results for smaller models but make little difference for larger ones (Figure 7 of the paper). Even when critiques were factually inaccurate or overstated, the revisions were still more harmless. The critique step acts as a scaffold — it helps the model "think through" the problem, even if the thinking is imperfect.
The SL phase gets the model to a reasonable starting point, but RL can push it much further. The standard RLHF recipe uses human preferences to train a preference model (PM), then uses PPO to optimize against it. CAI replaces the human preferences with AI preferences for harmlessness — this is RLAIF (RL from AI Feedback).
Take the SL-CAI model and generate two responses to each red-team prompt. These response pairs become the candidates for comparison.
Present each pair to a feedback model (a pretrained LM) in a multiple choice format:
The feedback model computes log probabilities for tokens "(A)" and "(B)". These probabilities become soft preference labels — not binary 0/1, but calibrated probabilities like 0.73 for A, 0.27 for B. This is crucial: soft labels preserve the model's uncertainty and produce better-calibrated preference models.
The AI-generated harmlessness labels are mixed with human helpfulness labels (135,296 from crowdworkers). This produces a hybrid PM: human feedback for helpfulness, AI feedback for harmlessness.
Standard PPO optimization against the hybrid PM, starting from the SL-CAI model. The SL phase was critical here — it moved the model "on distribution" so that RL didn't need to explore as far.
The RLAIF pipeline: AI generates comparison labels using constitutional principles, these train a preference model, which provides the reward signal for RL. Click stages to highlight each step.
One of the paper's most interesting findings: when the feedback model is asked to "think step by step" before choosing between responses, the quality of its evaluations improves dramatically.
Here's how it works. Instead of directly choosing (A) or (B), the feedback model first generates a chain-of-thought explanation:
The results were striking. On a combined HHH (helpful, honest, harmless) evaluation of 438 binary comparisons:
There was a practical problem with CoT labels: the chain-of-thought reasoning typically commits to one answer, making the final probabilities extremely confident (near 0 or 1). These overconfident labels caused RL to produce extreme, preachy responses.
The solution: clamp the CoT probabilities to the 40-60% range. This means even when the CoT reasoning strongly favors one response, the training label never exceeds 60% confidence. The clamping prevents reward hacking and produces more balanced, natural responses.
How evaluation accuracy scales with model size. Chain-of-thought dramatically closes the gap with human-feedback preference models.
Beyond improving label quality, CoT has a transparency benefit: the model's reasoning about harmfulness is visible. You can read exactly why the model preferred one response over another. This is fundamentally more interpretable than a preference model's opaque scalar score, and it aligns with the paper's goal of making AI decision-making more legible.
The results tell a clear story: Constitutional AI produces models that are harmless and non-evasive, resolving the core tension of RLHF.
The paper used crowdworker evaluations to compute Elo scores for helpfulness and harmlessness. Models were compared in open-ended conversation, and workers were instructed to prefer thoughtful engagement over evasion. The key findings:
Elo scores from crowdworker evaluations. Points further right are later in RL training. RL-CAI achieves a Pareto improvement: more harmless at equal helpfulness.
Compare the same prompt across models:
The RL-CAI model doesn't refuse. It engages thoughtfully, acknowledges the question's seriousness, and provides a substantive, harmless answer. This pattern held across PALMS, LaMDA, and InstructGPT prompt sets.
The paper also found that RL-CAI could be over-trained, producing Goodharting behavior: responses that included boilerplate phrases like "you are valid, valued, and cared for" regardless of context. This was mitigated by the anti-preachy constitutional principles and probability clamping.
How robust is the CAI model against adversarial attacks? The paper used extensive red teaming — both human-written and model-generated prompts — to stress-test the system.
The training data included 42,496 human-written red-team prompts from crowdworkers who were specifically tasked with baiting the model into saying something harmful. On top of that, 140,335 additional prompts were generated by few-shot prompting a pretrained model — automated red teaming at scale.
Beyond relative Elo comparisons, the paper also collected absolute harmfulness scores on a 0-4 scale (where higher is more harmful), using 64 hand-picked red-team prompts and 256 model responses per prompt.
How absolute harmfulness changes during RL training. Helpful RLHF gets MORE harmful; CAI models get progressively LESS harmful.
The results were stark:
The paper recognized that robustness remained an open challenge. While CAI models resisted direct harmful requests well, more subtle attacks — implicit harms, context-dependent manipulations, multi-turn baiting — were harder. The authors explicitly motivated future work on using chain-of-thought to reason through "hidden risks of certain behaviors, in order to mitigate increasingly subtle and implicit harms."
Constitutional AI is not just a training technique. It's a statement about how AI alignment could work at scale.
The central premise: as AI systems become more capable, we need them to help supervise other AI systems. Humans can't review every model output. But humans can write principles, and AI can apply those principles at scale. This is the core of scalable oversight — using AI to amplify human supervision.
CAI takes a radical step in this direction: the human's input for harmlessness is reduced to ~16 sentences. Everything else — the critique, the revision, the preference labeling — is done by the AI itself. The human provides direction; the AI provides labor.
The constitution creates a clear hierarchy of authority:
This hierarchy is intentionally transparent. The constitution is readable by anyone. The critiques and chain-of-thought reasoning are auditable. Compare this to RLHF, where the training objective is implicit in hundreds of thousands of opaque binary labels.
The SL phase of CAI is a form of self-improvement: the model critiques and revises its own outputs, then trains on the improved versions. This is a controlled, bounded version of the recursive self-improvement concept from AI safety literature. The constitution acts as a guardrail — the model can only improve along directions specified by human-written principles.
The paper honestly addresses the dual-use risks. By lowering the barrier to controlling AI behavior, CAI also makes it easier to train AI for harmful purposes. The SL method is particularly accessible since it doesn't require sophisticated RL infrastructure. And by reducing the need for human feedback, CAI makes it possible to deploy undertested systems. These are genuine risks that come with the efficiency gains.
RLHF (Christiano et al., 2017): The foundational framework. CAI keeps the RLHF structure (preference model + RL) but replaces human harmlessness labels with AI-generated ones guided by principles.
Training a Helpful and Harmless Assistant (Bai et al., 2022): Anthropic's prior work on RLHF that identified the helpfulness-harmlessness tension and evasion problem. CAI was designed specifically to resolve these issues.
Chain-of-thought prompting (Wei et al., 2022): The reasoning technique that dramatically improved AI evaluation quality. CAI's CoT evaluations approached human-feedback PM quality.
Red teaming LMs with LMs (Perez et al., 2022): Automated adversarial testing. CAI's non-evasive models enabled more effective red teaming at scale.
DPO (Rafailov et al., 2024): While DPO focuses on removing the RL loop, it inherits CAI's insight that preference data (whether from humans or AI) can directly improve policy behavior without complex reward modeling.
Claude's character: Claude's system prompt and training are descendants of the constitutional approach. The idea that a model's behavior should be governed by readable, debatable principles — not opaque training data — became core to Anthropic's alignment strategy.
RLAIF at scale: CAI proved that AI feedback could replace human feedback for specific behavioral dimensions. This approach has been adopted broadly: Llama, Gemini, and other models now use AI-generated training signals for various aspects of alignment.