Perception agents only learn as fast as we can feed them. Multimodal SFT is deceptively expensive on the data side, and at million-sample scale, naive pipelines leave a fleet of GPUs waiting on Python and data preprocessing.This talk walks through the SFT data pipeline we built to train vision-language models for perception agents. We rebuilt the data path so that image fetching, vision preprocessing, tokenization, and loss-mask generation all happen off the trainer's critical path, and only the artifacts the trainer actually consumes ever cross the boundary into the training loop. We pair this with a blended multi-dataset sampler designed for resumable streaming over very large mixes, and an I/O layer tuned for the realities of fetching multimodal data from object storage.The result: on large-scale VLM SFT runs, the trainer went from spending most of each step blocked on data to spending most of it training, a major improvement in useful GPU time. We'll share the architecture at a conceptual level, the gotchas at million-datapoint scale, and a mental model engineers can take home for the data side of any perception-agent stack.
Expo Stage 4 sessions at AI Engineer World's Fair 2026 in San Francisco.
Wednesday, July 1, 2026
3:45 PM - 4:05 PM·20m
Expo Stage 4
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Tarun Sunkaraneni
Amazon AGI
Tarun Sunkaraneni is speaking at AI Engineer World's Fair 2026.