Inference

Routing LLM Inference in Production: From Engine Signals to Policy

TalkIntermediate

Production LLM apps need more than a fast model: they need an inference routing layer that can choose where each request should run as engines, capacity, latency, and geography cost change. This talk shares a generalized Inference Load Balancer (ILB) proxy/controller architecture. A low-latency proxy applies routing weights and request-path signals, while a controller computes source-cluster-to-engine weights from demand, capacity/performance profiles, replica state, and geography cost. We will cover the practical debugging patterns AI engineers need: reading engine signals, explaining why a request went to one backend instead of another, handling retries and load shedding, and keeping routing behavior observable without exposing OpenAI-specific internals or non-public metrics. Speakers: Qianru Lao — OpenAI; Lu Zhang — OpenAI.

About the Inference Track

Inference sessions at AI Engineer World's Fair 2026 in San Francisco.

When

Thursday, July 2, 2026

11:10 AM - 11:30 AM·20m

Where

Track 9 · Room 2016

Capacity: 250 attendees

Speakers (2)

Qianru Lao

Member of Technical Staff, Inference

OpenAI

Qianru Lao is a Member of Technical Staff on the Inference team at OpenAI, where she works on infrastructure for large-scale model serving. Previously, she contributed to the open-source Delta Lake project at Databricks and worked on distributed storage systems at Alibaba Cloud and infrastructure tooling at Google. She holds degrees in Computational Science and Engineering from Harvard and Computer Science from Sun Yat-sen University.

Lu Zhang

Member of Technical Staff

OpenAI

Lu Zhang is an engineer at OpenAI focused on large-scale AI infrastructure. He currently works on inference systems and previously helped build and operate GPU clusters on OpenAI's Fleet team. His interests span Kubernetes, cloud-native platforms, distributed systems, reliability engineering, and machine learning infrastructure. He is passionate about scaling infrastructure for AI workloads and enabling reliable, efficient operation of GPU-accelerated clusters in production.

Routing LLM Inference in Production: From Engine Signals to Policy

TalkIntermediate

About the Inference Track

Inference sessions at AI Engineer World's Fair 2026 in San Francisco.