Hamel Husain & Shreya Shankar

Episode #116

Co-Founders of the 'Build Your Own Evals' Course

Consulting / UC Berkeley

🔍User Research⚡Execution👥Team & Culture

📝Full Transcript

19,277 words

Lenny Rachitsky (00:00:00): To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in. Hamel Husain (00:00:05): This process is a lot of fun. Everyone that does this immediately gets addicted to it. When you're building an AI application, you just learn a lot. Lenny Rachitsky (00:00:12): What's cool about this is you don't need to do this many, many times. For most products, you do this process once and then you build on it. Shreya Shankar (00:00:18): The goal is not to do evals perfectly, it's to actionably improve your product. Lenny Rachitsky (00:00:23): I did not realize how much controversy and drama there is around evals. There's a lot of people with very strong opinions. Shreya Shankar (00:00:28): People have been burned by evals in the past. People have done evals badly, so then they didn't trust it anymore, and then they're like, "Oh, I'm anti evals." Lenny Rachitsky (00:00:36): What are a couple of the most common misconceptions people have with evals? Hamel Husain (00:00:39): The top one is, "We live in the age of AI. Can't the AI just eval it?" But it doesn't work. Lenny Rachitsky (00:00:45): A term that you used in your posts that I love is this idea of a benevolent dictator. Hamel Husain (00:00:49): When you're doing this open coding, a lot of teams get bogged down in having a committee do this. For a lot of situations, that's wholly unnecessary. You don't want to make this process so expensive that you can't do it. You can appoint one person whose taste that you trust. It should be the person with domain expertise. Oftentimes, it is the product manager. Lenny Rachitsky (00:01:09): Today, my guests are Hamel Husain and Shreya Shankar. One of the most trending topics on this podcast over the past year has been the rise of evals. Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders. And since then, ...

💡 Key Takeaways

1Evals are not just tests; they are systematic data analytics for your AI application.
2Start with manual 'Open Coding': read 50-100 traces to identify real failure modes before writing code.
3Avoid 1-5 rating scales for LLM judges; use binary (Pass/Fail) criteria to force decision-making.
4Assign a 'Benevolent Dictator' (usually the PM) to define quality standards, rather than relying on design by committee.
5Validate your automated judges by calculating the agreement rate with human labels on a sample set.
6Evals function as dynamic Product Requirements Documents (PRDs), defining exactly how the AI should behave.
7Use AI to help categorize errors (Axial Coding), but keep a human in the loop for the final taxonomy.

📚Methodologies (3)

The Open-to-Axial Analysis Loop

by Hamel Husain & Shreya Shankar

🔍 User Research

A qualitative research method adapted for AI logs. Instead of guessing failure modes, builders manually review production traces ('Open Coding') to tag issues freely, then cluster these tags into broader categories ('Axial Coding') to identify high-leverage areas for improvement.

Core Principles

1.Read traces manually until 'Theoretical Saturation' (when you stop finding new types of errors).
2.Write 'Open Codes' (free-form notes) on the first upstream error you see per trace.
3.Use an LLM to synthesize Open Codes into 'Axial Codes' (categories) to find the top failure modes.

"To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in."

#open-to-axial#analysis#research

View Deep Dive →

Binary LLM Judge Framework

by Hamel Husain & Shreya Shankar

⚡ Execution

A strict method for building automated evaluators. Instead of asking an LLM to 'rate this response,' you define a specific failure mode and ask a binary (True/False) question. You then validate this judge against human decisions.

Core Principles

1.Binary Scoring: Force a decision (True/False) instead of a Likert scale (1-5) to remove ambiguity.
2.Specific Scope: Judge one specific failure mode (e.g., 'Did it fail to handoff?') per prompt, not overall quality.
3.Alignment Check: Measure the agreement matrix between the LLM Judge and a Human expert before deploying.

"If the judge says it's wrong, don't just accept it as the gospel... When people lose trust in your evals, they lose trust in you."

#binary#judge#execution

View Deep Dive →

The Benevolent Dictator Protocol

by Hamel Husain & Shreya Shankar

👥 Team & Culture

Instead of design-by-committee, appoint one domain expert (often the Product Manager) to define the 'ground truth' for evaluations. Their taste becomes the standard to align the model against initially.

Core Principles

1.Single Source of Truth: One person's judgment defines the 'Gold Label' for the initial dataset.
2.Domain Expertise: The dictator must understand the user's goal deeply (e.g., a leasing expert for a real estate bot).
3.Speed over Consensus: Prioritize getting a signal to iterate over getting everyone to agree on the nuance of language.

"You don't want to make this process so expensive that you can't do it. You can appoint one person whose taste that you trust."

#benevolent#dictator#protocol

View Deep Dive →

← Browse All Episodes