Your model generates gibberish. Once every thousand prompts. High confidence scores. No crashes. No warnings. We hit this twice while building Jamba models. First: A request gets misclassified during scheduling, loads stale state from a previous prompt cache slot, and confidently generates nonsense. Second: Logprob spikes during RL training that looked like training instability-until we noticed they tracked with rollout count, then with cache size. In this talk, we'll walk through both debugging journeys-the false starts, how we instrumented vLLM to thread request IDs through the forward pass, the search for variables that change failure structure rather than magnitude, and the lesson both share: distributed inference systems fail silently. No stack trace. No sanitizer warning. Just wrong answers with perfect confidence. You'll learn how to build comparison scripts that expose logprob divergence, force memory pressure to surface rare bugs, and shrink a distributed RL training mystery into a reproducible single-script failure. Walk away knowing how to debug vLLM when it lies to you quietly. Speakers: Asaf Gardin — AI21; Yuval Belfer — AI21 Labs.
Inference sessions at AI Engineer World's Fair 2026 in San Francisco.
Thursday, July 2, 2026
2:50 PM - 3:10 PM·20m
Track 9 · Room 2016
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Asaf Gardin
Inference Engineer
AI21
Asaf Gardin is a Senior Software Engineer on the inference team at AI21 Labs, where he works on high-performance LLM inference and the production deployment of the Jamba hybrid SSM-Transformer models. He's an active vLLM committer, contributing to quantization, scheduling, and support for Mamba-based architectures. His talk covers two production bugs in vLLM's Mamba support - a scheduler edge case that corrupted SSM state under memory pressure, and a 32-bit integer overflow in a CUDA kernel that surfaced as RL training instability - both root-caused at AI21 and fixed upstream. He also built Kernel Academy, a browser-based tutorial for learning Triton GPU programming. Previously at IBM.
AI21 Labs
Senior Developer Advocate at AI21 Labs; also involved with AI Tinkerers and YAAP (Yet Another AI Podcast).