Every team building browser agents has the same problem: you can't trust your own evals. Browser tasks are too open-ended for deterministic checks, so teams use LLM verifiers as judges, and the judges are wrong constantly. WebVoyager misses 45% of failures. WebJudge misses 22%. Used as RL reward, you're not training a better agent, you're training a more confident liar. This talk walks through the Universal Verifier, open-sourced with Microsoft Research: false positive rate near zero, Cohen's kappa matching human-human agreement. Four design principles, one open benchmark, and an honest account of where auto-research worked and where it plateaued. Speakers: Miguel González Fernández — Browserbase; Corby Rosset — Microsoft Research.
Expo Stage 1 sessions at AI Engineer World's Fair 2026 in San Francisco.
Thursday, July 2, 2026
11:40 AM - 12:00 PM·20m
Expo Stage 1
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Miguel González Fernández
Browserbase
Miguel González Fernández is speaking at AI Engineer World's Fair 2026.

Corby Rosset
Microsoft Research
Corby Rosset is speaking at AI Engineer World's Fair 2026.