Hamel Husain & Shreya Shankar

Co-Founders of the 'Build Your Own Evals' Course

Consulting / UC Berkeley

🔍 User Research (1)⚡ Execution (1)👥 Team & Culture (1)

Key Takeaways

1.Evals are not just tests; they are systematic data analytics for your AI application.
2.Start with manual 'Open Coding': read 50-100 traces to identify real failure modes before writing code.
3.Avoid 1-5 rating scales for LLM judges; use binary (Pass/Fail) criteria to force decision-making.
4.Assign a 'Benevolent Dictator' (usually the PM) to define quality standards, rather than relying on design by committee.
5.Validate your automated judges by calculating the agreement rate with human labels on a sample set.
6.Evals function as dynamic Product Requirements Documents (PRDs), defining exactly how the AI should behave.
7.Use AI to help categorize errors (Axial Coding), but keep a human in the loop for the final taxonomy.

Methodologies(3)

The Open-to-Axial Analysis Loop

by Hamel Husain & Shreya Shankar

🔍 User Research

A qualitative research method adapted for AI logs. Instead of guessing failure modes, builders manually review production traces ('Open Coding') to tag issues freely, then cluster these tags into broader categories ('Axial Coding') to identify high-leverage areas for improvement.

Core Principles

1.Read traces manually until 'Theoretical Saturation' (when you stop finding new types of errors).
2.Write 'Open Codes' (free-form notes) on the first upstream error you see per trace.
3.Use an LLM to synthesize Open Codes into 'Axial Codes' (categories) to find the top failure modes.

"To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in."

#open-to-axial#analysis#research

View Deep Dive →

Binary LLM Judge Framework

by Hamel Husain & Shreya Shankar

⚡ Execution

A strict method for building automated evaluators. Instead of asking an LLM to 'rate this response,' you define a specific failure mode and ask a binary (True/False) question. You then validate this judge against human decisions.

Core Principles

1.Binary Scoring: Force a decision (True/False) instead of a Likert scale (1-5) to remove ambiguity.
2.Specific Scope: Judge one specific failure mode (e.g., 'Did it fail to handoff?') per prompt, not overall quality.
3.Alignment Check: Measure the agreement matrix between the LLM Judge and a Human expert before deploying.

"If the judge says it's wrong, don't just accept it as the gospel... When people lose trust in your evals, they lose trust in you."

#binary#judge#execution

View Deep Dive →

The Benevolent Dictator Protocol

by Hamel Husain & Shreya Shankar

👥 Team & Culture

Instead of design-by-committee, appoint one domain expert (often the Product Manager) to define the 'ground truth' for evaluations. Their taste becomes the standard to align the model against initially.

Core Principles

1.Single Source of Truth: One person's judgment defines the 'Gold Label' for the initial dataset.
2.Domain Expertise: The dictator must understand the user's goal deeply (e.g., a leasing expert for a real estate bot).
3.Speed over Consensus: Prioritize getting a signal to iterate over getting everyone to agree on the nuance of language.

"You don't want to make this process so expensive that you can't do it. You can appoint one person whose taste that you trust."

#benevolent#dictator#protocol

View Deep Dive →