by Benjamin Mann
A method for aligning AI models by training them to follow a set of natural language principles (a 'Constitution') using AI feedback (RLAIF), rather than relying solely on human contractors.
Core Principles
- 1.Define Principles: Establish a constitution of values (e.g., helpful, harmless, honest, human rights).
- 2.Generate & Critique: The model generates a response, then critiques itself based on the constitution.
- 3.Recursive Revision: If the response violates principles, the model rewrites it.
- +1 more...
"First we figure out which ones might apply... then we ask the model itself to critique itself and rewrite its own response in light of the principle."