Production LLM apps need more than a fast model: they need an inference routing layer that can choose where each request should run as engines, capacity, latency, and geography cost change. This talk shares a generalized Inference Load Balancer (ILB) proxy/controller architecture. A low-latency proxy applies routing weights and request-path signals, while a controller computes source-cluster-to-engine weights from demand, capacity/performance profiles, replica state, and geography cost. We will cover the practical debugging patterns AI engineers need: reading engine signals, explaining why a request went to one backend instead of another, handling retries and load shedding, and keeping routing behavior observable without exposing OpenAI-specific internals or non-public metrics. Speakers: Qianru Lao — OpenAI; Lu Zhang — OpenAI.
Inference sessions at AI Engineer World's Fair 2026 in San Francisco.
Thursday, July 2, 2026
11:10 AM - 11:30 AM·20m
Track 9 · Room 2016
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Qianru Lao
Member of Technical Staff, Inference
OpenAI
Qianru Lao is a Member of Technical Staff on the Inference team at OpenAI, where she works on infrastructure for large-scale model serving. Previously, she contributed to the open-source Delta Lake project at Databricks and worked on distributed storage systems at Alibaba Cloud and infrastructure tooling at Google. She holds degrees in Computational Science and Engineering from Harvard and Computer Science from Sun Yat-sen University.

Lu Zhang
Member of Technical Staff
OpenAI
Lu Zhang is an engineer at OpenAI focused on large-scale AI infrastructure. He currently works on inference systems and previously helped build and operate GPU clusters on OpenAI's Fleet team. His interests span Kubernetes, cloud-native platforms, distributed systems, reliability engineering, and machine learning infrastructure. He is passionate about scaling infrastructure for AI workloads and enabling reliable, efficient operation of GPU-accelerated clusters in production.