Most AI models start as text systems and have vision, audio, and other modalities added later. That ordering shows up in the work: handoffs between modalities, brittle understanding of mixed inputs, and gaps that surface exactly when real tasks demand reading a chart, a document, and code together. This session looks at a different approach — models trained as multimodal from step zero, where text, image, audio, and video share the same foundation rather than being stitched together. We'll look at why that matters for the kind of work organizations actually want from AI: understanding messy, mixed real-world inputs, holding context across them, and carrying complex tasks end to end. The throughline is what this unlocks for teams deciding where AI can take real work today — and how MiniMax is building toward that frontier.
Expo Stage 4 sessions at AI Engineer World's Fair 2026 in San Francisco.
Wednesday, July 1, 2026
1:55 PM - 2:15 PM·20m
Expo Stage 4
Capacity: 250 attendees
Sign in to add this talk to your schedule.
TBA
Speaker
Speaker to be announced.