Most voice interfaces today are built as a 3-way cascade system (ASR/LLM/TTS). While functional, this cascaded approach introduces latency bottlenecks, strips away non-verbal nuance, and limits emotion-aware, multi-turn dialogue. Today, we are witnessing a profound shift toward native speech-to-speech models that process audio natively from end to end. In this session, we’ll explore the exciting paradigm at Google DeepMind to train speech-to-speech models for real-time voice agents. We will cover the high-level product and research challenges of building voice agents that feel truly conversational, optimizing for fluid turn-taking and low latency while maintaining enterprise-grade intelligence. Speakers: Valeria Wu — Google DeepMind; Tom Ouyang — Google DeepMind.
Voice & Realtime AI sessions at AI Engineer World's Fair 2026 in San Francisco.
Tuesday, June 30, 2026
11:10 AM - 11:30 AM·20m
Track 6 · Room 2014
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Valeria Wu
Product Manager
Google DeepMind
Valeria Wu is a Product Manager at Google DeepMind for the Gemini Live model, driving the development of real-time, speech-to-speech AI agents. A graduate of the Google APM program, she previously worked on the Pixel AI team and was part of the founding team at Cometa, an ed-tech startup in Latin America. Valeria studied Symbolic Systems (CS, Neuroscience, and Philosophy) at Stanford University, where she focused on human-centered AI. Originally from Lima, Peru, she is a single-digit golfer and a dedicated foodie.

Tom Ouyang
Google DeepMind
Tom Ouyang is speaking at AI Engineer World's Fair 2026.