Evals-Driven Development Cycle
by Kevin Weil • Chief Product Officer at OpenAI
Kevin Weil is the Chief Product Officer at OpenAI. Previously, he served as Head of Product at Instagram and Twitter, and was the co-creator of the Libra cryptocurrency at Facebook. He also serves on the boards of Planet and Strava.
🎙️ Episode Context
Kevin Weil discusses the unique challenges of building product at OpenAI, emphasizing 'Model Maximalism' and the necessity of iterative deployment in a rapidly evolving AI landscape. He explores how the role of Product Managers is shifting towards defining evaluations ('evals') and maintaining high agency amidst ambiguity. The conversation also covers the integration of research and product teams, the future of AI-assisted creativity, and the strategic importance of treating AI interactions like human collaborations.
Problem It Solves
Managing the non-deterministic nature of LLMs where inputs are fuzzy and outputs vary, making traditional QA insufficient.
Framework Overview
Product Managers must define 'hero use cases' and translate them into specific evaluations (evals)—essentially quizzes for the model. Development becomes a process of hill-climbing on these eval scores, often using fine-tuning to improve performance on specific tasks.
🧠 Framework Structure
Define Hero Use Cases: Identify the s...
Create Custom Evals: Build a dataset ...
Fine-tune & Hill Climb: Use the data ...
When to Use
When building any AI feature where accuracy and reliability are critical, specifically for B2B or complex consumer queries.
Common Mistakes
Relying on 'vibes' or manual spot-checking instead of rigorous, data-driven evaluations.
Real World Example
Building OpenAI's 'Deep Research' product required creating evals for complex research tasks that would normally take humans hours to complete.
Writing evals is quickly becoming a core skill for product builders.
— Kevin Weil