Constitutional AI
by Benjamin Mann • Co-founder at Anthropic
Former architect of GPT-3 at OpenAI, now leads product engineering and safety alignment at Anthropic.
🎙️ Episode Context
Benjamin Mann discusses the trajectory of AI development towards superintelligence by 2027-2028, the critical importance of AI safety, and Anthropic's unique approach to alignment. He details the implementation of Constitutional AI, the Responsible Scaling Policy (ASL levels), and the 'Resting in Motion' mindset for navigating high-stakes work.
Problem It Solves
Scales safety alignment without relying on extensive human labor; prevents the 'Monkey Paw' problem where AI misunderstands human intent.
Framework Overview
A method for aligning AI models by training them to follow a set of natural language principles (a 'Constitution') using AI feedback (RLAIF), rather than relying solely on human contractors.
⚡ Step-by-Step Framework
Define Principles: Establish a constitution of values (e.g., helpful, harmless, honest, human rights).
Generate & Critique: The model generates a response, then critiques itself based on the constitution.
Recursive Revision: If the response violates principles, the model rewrites it.
Supervised Learning: The model is fine-tuned on these revised, compliant outputs.
Define Principles: Establish a constitution of values (e.g., helpful, harmless, honest, human rights).
Generate & Critique: The model generates a response, then critiques itself based on the constitution.
Recursive Revision: If the response violates principles, the model rewrites it.
Supervised Learning: The model is fine-tuned on these revised, compliant outputs.
When to Use
When training Large Language Models (LLMs) to ensure they adhere to complex human values and safety guidelines.
Common Mistakes
Relying on simple user feedback (RLHF) which can lead to sycophancy; failing to define explicit values.
Real World Example
Anthropic uses this to train Claude, incorporating principles from the UN Declaration of Human Rights and other sources.
First we figure out which ones might apply... then we ask the model itself to critique itself and rewrite its own response in light of the principle.
— Benjamin Mann