Hardware failures in large-scale distributed training are inevitable when you're running thousands of GPUs, they happen multiple times a day. The standard response is manual intervention: an engineer gets paged, SSHs into the cluster, and spends an hour fixing something the infrastructure should have handled automatically. That lost time compounds directly into wasted compute and delayed research. This session walks through the self-healing platform Crusoe built to eliminate that manual loop entirely a managed Slurm environment running on Kubernetes, with automated node failure remediation and real-time cluster observability and how these components work together so hardware failures become a non-event. We'll cover this architecture end-to-end: how running Slurm on Kubernetes unlocks infrastructure resilience that traditional GPU clusters don't have, how automated hardware monitoring and node remediation can eliminate manual intervention entirely, and how full observability into every remediation event keeps engineering teams informed without keeping them on-call. For teams that want deeper control, we'll also discuss open-loop remediation, which gives teams full control over the node replacement process for application-specific workflows.
Expo Stage 1 sessions at AI Engineer World's Fair 2026 in San Francisco.
Wednesday, July 1, 2026
11:40 AM - 12:00 PM·20m
Expo Stage 1
Capacity: 250 attendees
Sign in to add this talk to your schedule.
TBA
Speaker
Speaker to be announced.