AI Engineer World's Fair 2026

Natively Multimodal from Step Zero

TalkIntermediate

Most AI models start as text systems and have vision, audio, and other modalities added later. That ordering shows up in the work: handoffs between modalities, brittle understanding of mixed inputs, and gaps that surface exactly when real tasks demand reading a chart, a document, and code together. This session looks at a different approach — models trained as multimodal from step zero, where text, image, audio, and video share the same foundation rather than being stitched together. We'll look at why that matters for the kind of work organizations actually want from AI: understanding messy, mixed real-world inputs, holding context across them, and carrying complex tasks end to end. The throughline is what this unlocks for teams deciding where AI can take real work today — and how MiniMax is building toward that frontier.

About the Expo Stage 4 Track

Expo Stage 4 sessions at AI Engineer World's Fair 2026 in San Francisco.

Natively Multimodal from Step Zero

About the Expo Stage 4 Track

When

Where

Speaker

Natively Multimodal from Step Zero

About the Expo Stage 4 Track

When

Where

Speaker