Most AI teams focus on model quality, but production success often comes down to inference performance. In this session, FriendliAI will explore the optimization techniques behind high-performance LLM serving, including continuous batching, speculative decoding, smart caching, and efficient GPU utilization. Learn how leading AI teams reduce infrastructure costs, improve latency, and scale inference workloads without sacrificing performance. We'll share practical insights and deployment strategies that separate experimental AI projects from production-grade systems.Whether you're an ML engineer, platform engineer, MLOps practitioner, or technical founder, you'll leave with a better understanding of how inference optimization can become a competitive advantage for your AI applications. Speakers: Alex Campos; Yunmo Koo.
Expo Stage 1 sessions at AI Engineer World's Fair 2026 in San Francisco.
Wednesday, July 1, 2026
2:50 PM - 3:10 PM·20m
Expo Stage 1
Capacity: 250 attendees
Sign in to add this talk to your schedule.
Alex Campos
Alex Campos is speaking at AI Engineer World's Fair 2026.
Yunmo Koo
Yunmo Koo is speaking at AI Engineer World's Fair 2026.