Every voice AI engineer has heard it: a caller repeating their name three times, getting more frustrated with each attempt. The logs look clean. Confidence scores look fine. The system looks like it's working, but it isn't. Building a voice agent today means chasing answers across a dozen scattered sources, ASR, TTS, turn-taking, prompts, and LLMs, without a comprehensive map of the thing you're actually building: a conversation. That map exists, and it's called linguistics. With it, the scattered pieces fall into their right spots and order, and you stop patching components and start conducting the orchestra. The map starts with a simple model. Every conversation runs on two channels: the form (sounds, words, syntax, turns) and the meaning (the task, the situation, the feeling). Users keep both channels aligned, with each other and with their partner's, continuously and without thinking. Your job is to build an agent that does the same: keeping its form and meaning channels aligned, with each other, and with the user's, constantly and seamlessly. The map survives every architecture shift, cascaded or speech-to-speech, because it describes the conversation, not the implementation. From it, you get both halves of the job: design questions for build time, and a matrix that turns "the agent just didn't get it" into concrete, debuggable failure modes. You'll leave with the map, the questions, and an open-source evaluation framework to run them with. Who this is for: voice AI engineers, ML practitioners on voice pipelines, and anyone who's watched clean logs while their agent quietly fails real users.
Voice & Realtime AI sessions at AI Engineer World's Fair 2026 in San Francisco.
Tuesday, June 30, 2026
3:20 PM - 3:40 PM·20m
Track 6 · Room 2014
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Midam Kim
Senior Linguist and ML Engineer
ServiceNow
Midam Kim is a Senior Linguist and ML Engineer at ServiceNow, where she builds and evaluates a multilingual voice AI platform spanning a dozen languages. She holds a PhD in Linguistics from Northwestern University and teaches at Fisher College of Business, Ohio State University. Her work sits at the rare intersection of production ML engineering and speech science—translating decades of linguistic research into the engineering decisions voice AI teams are making right now.