The Benevolent Dictator Protocol
by Hamel Husain & Shreya Shankar • Co-Founders of the 'Build Your Own Evals' Course at Consulting / UC Berkeley
Hamel is a machine learning engineer with experience at GitHub and Airbnb, now a leading AI consultant. Shreya is a computer scientist and researcher at UC Berkeley, specializing in ML operationalization. Together, they run the top-rated course on Maven about building AI evaluations.
🎙️ Episode Context
This episode demystifies 'Evals' (evaluations) for AI products, arguing they are the highest ROI activity for AI teams. Hamel and Shreya demonstrate a practical workflow starting from manual error analysis ('open coding') to building automated 'LLM-as-a-Judge' systems. They challenge the misconception that evals are just unit tests, framing them instead as a continuous data analysis process that replaces traditional PRDs for AI agents.
Problem It Solves
Analysis paralysis where teams debate what counts as a 'good' response, slowing down the iteration cycle.
Framework Overview
Instead of design-by-committee, appoint one domain expert (often the Product Manager) to define the 'ground truth' for evaluations. Their taste becomes the standard to align the model against initially.
🧠 Framework Structure
Single Source of Truth: One person's ...
Domain Expertise: The dictator must u...
Speed over Consensus: Prioritize gett...
When to Use
Early stage development or when establishing a new class of evaluations for subjective tasks.
Common Mistakes
Letting engineers define product quality without domain context, or trying to average opinions from a group.
Real World Example
Deciding that a recruiting email starting with 'Given your background...' is bad/generic, based solely on the PM's taste, and optimizing against that.
You don't want to make this process so expensive that you can't do it. You can appoint one person whose taste that you trust.
— Hamel Husain & Shreya Shankar