AI Engineer World's Fair 2026

Evals in AI: A Deep Dive

WorkshopAdvanced

“Our evals pass and our velocity is up, so it works.” It’s the most reassuring sentence in AI engineering and also the most dangerous. Teams are shipping more code than ever while incidents per PR and change-failure rates climb, and the instruments meant to catch this are quietly broken. This talk takes apart both halves of that false comfort. First, why velocity lies: the same AI-driven throughput that lights up your dashboard is what’s eroding quality underneath it. Then we explore four ways offline evals deceive you: LLM-as-judge bias (your grader rewards confident, wordy, wrong answers over terse correct ones), staleness, distribution shift between your golden set and real traffic, and single-score evals that hide which step of an agent actually failed. The centerpiece is a live demo. We’ll wire up an LLM judge on stage and watch it crown a confident, friendly, factually wrong answer. Then we’ll fix it live on stage with a three-line rubric change. Same model, different instrument. From there we’ll build up what to measure instead: traces and spans, production observability, probe-based evaluation, error budgets, and quality leading indicators that sit beside every velocity number. Attendees will leave with a five-line checklist they can apply Monday. No prior eval tooling required. If you’ve ever shipped something agentic and had a nagging feeling the dashboards were too kind, this is for you.

About the Workshops Day 1 Track

Workshops Day 1 sessions at AI Engineer World's Fair 2026 in San Francisco.

Evals in AI: A Deep Dive

WorkshopAdvanced

About the Workshops Day 1 Track

Workshops Day 1 sessions at AI Engineer World's Fair 2026 in San Francisco.

Evals in AI: A Deep Dive

About the Workshops Day 1 Track

When

Where

Speaker

Evals in AI: A Deep Dive

About the Workshops Day 1 Track

When

Where

Speaker