AI Engineer World's Fair 2026

All the Things We Have to Do to Satisfy Your Insatiable Need for Tokens

TalkIntermediate

Every time the industry figures out how to serve tokens faster and cheaper, the appetite grows to match. Models get bigger, contexts get longer, agents start chaining thousands of calls together. The finish line keeps moving. This talk is a technical tour through everything the industry has done to keep up, led by two experts in high-performance inference. We'll start with the optimizations that made hardware work harder without changing the underlying architecture. Then we'll go up a level with techniques that work smarter across requests and across the model itself. And finally, a peek into the future with heterogeneous disaggregated inference, the architectural shift that splits prefill and decode across specialized hardware, and even more advanced forms of hardware specialization coming your way soon. Token demand is about to get a lot more insatiable. Let's see what the future has in store for us! Speakers: Daniel Kim — Cerebras Systems; Natalie Serrino — Gimlet Labs.

About the Inference Track

Inference sessions at AI Engineer World's Fair 2026 in San Francisco.

All the Things We Have to Do to Satisfy Your Insatiable Need for Tokens

About the Inference Track

When

Where

Speakers (2)

All the Things We Have to Do to Satisfy Your Insatiable Need for Tokens

About the Inference Track

When

Where

Speakers (2)