AI Engineer World's Fair 2026

Fault-Tolerant Training at Scale: Making Hardware Failures a Non-Event

TalkIntermediate

Hardware failures in large-scale distributed training are inevitable when you're running thousands of GPUs, they happen multiple times a day. The standard response is manual intervention: an engineer gets paged, SSHs into the cluster, and spends an hour fixing something the infrastructure should have handled automatically. That lost time compounds directly into wasted compute and delayed research. This session walks through the self-healing platform Crusoe built to eliminate that manual loop entirely a managed Slurm environment running on Kubernetes, with automated node failure remediation and real-time cluster observability and how these components work together so hardware failures become a non-event. We'll cover this architecture end-to-end: how running Slurm on Kubernetes unlocks infrastructure resilience that traditional GPU clusters don't have, how automated hardware monitoring and node remediation can eliminate manual intervention entirely, and how full observability into every remediation event keeps engineering teams informed without keeping them on-call. For teams that want deeper control, we'll also discuss open-loop remediation, which gives teams full control over the node replacement process for application-specific workflows.

About the Expo Stage 1 Track

Expo Stage 1 sessions at AI Engineer World's Fair 2026 in San Francisco.

Fault-Tolerant Training at Scale: Making Hardware Failures a Non-Event

About the Expo Stage 1 Track

When

Where

Speaker

Fault-Tolerant Training at Scale: Making Hardware Failures a Non-Event

About the Expo Stage 1 Track

When

Where

Speaker