Lu Zhang

Member of Technical Staff

OpenAI

Lu Zhang is an engineer at OpenAI focused on large-scale AI infrastructure. He currently works on inference systems and previously helped build and operate GPU clusters on OpenAI's Fleet team. His interests span Kubernetes, cloud-native platforms, distributed systems, reliability engineering, and machine learning infrastructure. He is passionate about scaling infrastructure for AI workloads and enabling reliable, efficient operation of GPU-accelerated clusters in production.

Sessions (1)

Routing LLM Inference in Production: From Engine Signals to Policy

11:10 AM·Track 9 · Room 2016

Production LLM apps need more than a fast model: they need an inference routing layer that can choose where each request should run as engines, capacity, latency, and geography cost change. This talk shares a generalized Inference Load Balancer (ILB) proxy/controller architecture. A low-latency proxy applies routing weights and request-path signals, while a controller computes source-cluster-to-engine weights from demand, capacity/performance profiles, replica state, and geography cost. We will cover the practical debugging patterns AI engineers need: reading engine signals, explaining why a request went to one backend instead of another, handling retries and load shedding, and keeping routing behavior observable without exposing OpenAI-specific internals or non-public metrics. Speakers: Qianru Lao — OpenAI; Lu Zhang — OpenAI.

Inferenceintermediatetalk