3:20 PM·Track 6 · Room 2014
Every voice AI engineer has heard it: a caller repeating their name three times, getting more frustrated with each attempt. The logs look clean. Confidence scores look fine. The system looks like it's working, but it isn't.
Building a voice agent today means chasing answers across a dozen scattered sources, ASR, TTS, turn-taking, prompts, and LLMs, without a comprehensive map of the thing you're actually building: a conversation. That map exists, and it's called linguistics. With it, the scattered pieces fall into their right spots and order, and you stop patching components and start conducting the orchestra.
The map starts with a simple model. Every conversation runs on two channels: the form (sounds, words, syntax, turns) and the meaning (the task, the situation, the feeling). Users keep both channels aligned, with each other and with their partner's, continuously and without thinking. Your job is to build an agent that does the same: keeping its form and meaning channels aligned, with each other, and with the user's, constantly and seamlessly.
The map survives every architecture shift, cascaded or speech-to-speech, because it describes the conversation, not the implementation. From it, you get both halves of the job: design questions for build time, and a matrix that turns "the agent just didn't get it" into concrete, debuggable failure modes. You'll leave with the map, the questions, and an open-source evaluation framework to run them with.
Who this is for: voice AI engineers, ML practitioners on voice pipelines, and anyone who's watched clean logs while their agent quietly fails real users.
Voice & Realtime AIintermediatetalk