Amit Desai

Director of Voice AI

Roku

Amit Desai is a deep domain expert and technical and UX leader in voice and multimodal AI assistants. His career spans pioneering Voice AI products across highly diverse consumer markets, including key advancements at Amazon Alexa, successful conversational AI startup exits, and creating and leading the TV AI assistant program at Roku. Amit possesses a unique combination of deep user intuition for voice modalities and technical acumen for fast-changing speech, dialog, and LLM orchestration stacks. His work focuses on establishing the forward-looking vision and robust engineering frameworks required to bring safe, highly autonomous, and ambient voice interactions to emerging product verticals, wearables, smart environments, and robotics. He has a B.Sc. in Computer Science from MIT and a background in the creative and language arts.

Sessions (1)

The Goldilocks problem: when your Robot asks too much — or acts too soon.

3:45 PM·Track 6 · Room 2014

Embodied agents are crossing from answering questions to taking physical actions — moving a box, turning a wheel — and people will command them by voice, because voice is the fastest, most natural interface we have. But voice is also the most error-prone, and when a misheard command drives a physical action, the failure isn't a wrong answer; it's human harm, damage, or an expensive, irreversible mistake. The field has never needed a serious way to handle voice-command errors, because informational agents made them cheap. Embodiment ends that. This talk replaces the usual hand-waving — "don't ask too much, don't get it wrong too much" — with a single number you can optimize. The core idea: both confirming and erring cost the user. A confirmation is friction — attention, time, a delayed action; a wrong action is a mistake cost, often higher given physical harm or expense. Put them on one ledger and you can measure a voice interface as average user cost per command, and make minimizing it the system's objective. From that falls a non-obvious rule — you confirm or not based on both cost and uncertainty: an expected value. I'll frame confirmation as just one option alongside acting, disambiguation (choices), and deferring; reason at the level of goals rather than low-level motion; walk the architecture (task hypotheses → user-cost model → confirmation policy); and show eval results from a simulated environment measuring regret against oracle behavior. I'll close with what worked applying this to voice in smart TVs, speakers, and navigation — and a challenge to bring this metric to robots, cars, and wearables before the errors do.

Voice & Realtime AIintermediate