“Our evals pass and our velocity is up, so it works.” It’s the most reassuring sentence in AI engineering and also the most dangerous. Teams are shipping more code than ever while incidents per PR and change-failure rates climb, and the instruments meant to catch this are quietly broken. This talk takes apart both halves of that false comfort. First, why velocity lies: the same AI-driven throughput that lights up your dashboard is what’s eroding quality underneath it. Then we explore four ways offline evals deceive you: LLM-as-judge bias (your grader rewards confident, wordy, wrong answers over terse correct ones), staleness, distribution shift between your golden set and real traffic, and single-score evals that hide which step of an agent actually failed. The centerpiece is a live demo. We’ll wire up an LLM judge on stage and watch it crown a confident, friendly, factually wrong answer. Then we’ll fix it live on stage with a three-line rubric change. Same model, different instrument. From there we’ll build up what to measure instead: traces and spans, production observability, probe-based evaluation, error budgets, and quality leading indicators that sit beside every velocity number. Attendees will leave with a five-line checklist they can apply Monday. No prior eval tooling required. If you’ve ever shipped something agentic and had a nagging feeling the dashboards were too kind, this is for you.
Workshops Day 1 sessions at AI Engineer World's Fair 2026 in San Francisco.
Monday, June 29, 2026
12:10 PM - 1:10 PM·1h
Track 1 · Room 2010
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Tejas Kumar
International Keynote Speaker
IBM
@tejaskumar_
Tejas Kumar is an international keynote speaker with over 20 years of engineering experience. Today, he speaks at conferences aiming to equip, empower, and encourage developers about the best ways to build software worldwide.