AI Engineer World's Fair 2026

Agents That Own Their Inference: Building Production AI Agents on Dedicated GPUs

Sponsor SessionIntermediate

Every production agent today is renting its intelligence. You're paying per token, sending your customer's data to someone else's servers, and hoping the provider doesn't rate-limit you during your launch. For most teams, that's fine. But for a growing number of teams in regulated industries, with high-volume products, latency-sensitive workloads, or rising token bills, it's starting to look like a liability. In this 120-minute hands-on workshop you'll get a dedicated GPU and build an agent that runs on infrastructure you control. You'll stand up vLLM, point your agent at it, and drive concurrent load through the stack until you can see batching, KV cache pressure, and throughput limits in the metrics. Then you'll optimize the deployment to improve throughput while keeping per-request latency in line. The focus isn't agent frameworks. It's the inference layer underneath them. You'll leave with working code and a real understanding of continuous batching under real concurrency, KV cache tradeoffs, vLLM's metrics, and the bottlenecks that only show up when you operate the inference server yourself.

About the Track 7 Track

Track 7 sessions at AI Engineer World's Fair 2026 in San Francisco.

Agents That Own Their Inference: Building Production AI Agents on Dedicated GPUs

About the Track 7 Track

When

Where

Speaker

Agents That Own Their Inference: Building Production AI Agents on Dedicated GPUs

About the Track 7 Track

When

Where

Speaker