Physical AI had its “Attention Is All You Need” moment with the rise of Vision-Language-Action models. The next bottleneck is data: not just more video, but the ability to find the exact real-world moments that teach models how the world works: gravity, motion, causality, human behavior, and object interactions. This session explores a new approach: discovering specific scenes from the vastness of the web. We’ll show how teams can search for moments like objects falling, people interacting with environments, or actions unfolding over time, then collect and structure only the relevant clips for training and evaluation. Attendees will learn how scene-level discovery changes multimodal data pipelines, reducing wasted collection, processing, storage, and review, while making it easier to build targeted datasets for VLA systems, robotics, physical AI, and agentic world models.
Expo Stage 2 sessions at AI Engineer World's Fair 2026 in San Francisco.
Tuesday, June 30, 2026
2:50 PM - 3:10 PM·20m
Expo Stage 2
Capacity: 250 attendees
Sign in to add this talk to your schedule.
TBA
Speaker
Speaker to be announced.