Every production agent today is renting its intelligence. You're paying per token, sending your customer's data to someone else's servers, and hoping the provider doesn't rate-limit you during your launch. For most teams, that's fine. But for a growing number of teams in regulated industries, with high-volume products, latency-sensitive workloads, or rising token bills, it's starting to look like a liability. In this 120-minute hands-on workshop you'll get a dedicated GPU and build an agent that runs on infrastructure you control. You'll stand up vLLM, point your agent at it, and drive concurrent load through the stack until you can see batching, KV cache pressure, and throughput limits in the metrics. Then you'll optimize the deployment to improve throughput while keeping per-request latency in line. The focus isn't agent frameworks. It's the inference layer underneath them. You'll leave with working code and a real understanding of continuous batching under real concurrency, KV cache tradeoffs, vLLM's metrics, and the bottlenecks that only show up when you operate the inference server yourself.
Track 7 sessions at AI Engineer World's Fair 2026 in San Francisco.
Monday, June 29, 2026
9:00 AM - 11:00 AM·2h
Track 7 · Room 2024
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Du'an Lightfoot
Senior AI Engineer
Akamai Technologies
Senior AI Engineer at Akamai Technologies specializing in artificial intelligence and network engineering. Previously served as a Senior Developer Advocate at AWS and is the founder of LabEveryDay.