Asaf Gardin

Inference Engineer

AI21

Asaf Gardin is a Senior Software Engineer on the inference team at AI21 Labs, where he works on high-performance LLM inference and the production deployment of the Jamba hybrid SSM-Transformer models. He's an active vLLM committer, contributing to quantization, scheduling, and support for Mamba-based architectures. His talk covers two production bugs in vLLM's Mamba support - a scheduler edge case that corrupted SSM state under memory pressure, and a 32-bit integer overflow in a CUDA kernel that surfaced as RL training instability - both root-caused at AI21 and fixed upstream. He also built Kernel Academy, a browser-based tutorial for learning Triton GPU programming. Previously at IBM.

Sessions (1)

Two Bugs That Hid in Plain Sight: A vLLM Debugging Detective Story

2:50 PM·Track 9 · Room 2016

Your model generates gibberish. Once every thousand prompts. High confidence scores. No crashes. No warnings. We hit this twice while building Jamba models. First: A request gets misclassified during scheduling, loads stale state from a previous prompt cache slot, and confidently generates nonsense. Second: Logprob spikes during RL training that looked like training instability-until we noticed they tracked with rollout count, then with cache size. In this talk, we'll walk through both debugging journeys-the false starts, how we instrumented vLLM to thread request IDs through the forward pass, the search for variables that change failure structure rather than magnitude, and the lesson both share: distributed inference systems fail silently. No stack trace. No sanitizer warning. Just wrong answers with perfect confidence. You'll learn how to build comparison scripts that expose logprob divergence, force memory pressure to surface rare bugs, and shrink a distributed RL training mystery into a reproducible single-script failure. Walk away knowing how to debug vLLM when it lies to you quietly. Speakers: Asaf Gardin — AI21; Yuval Belfer — AI21 Labs.

Inferenceintermediatetalk