Webinar On Demand

Handling Hardware Failures During Training: A Comparative Analysis of Fault Tolerant Training Frameworks

View a Complimentary Live Webinar Presented by Clockwork.io

At scale, hardware failures become a statistical certainty in distributed training. Mean Time Between Failure (MTBF) decreases rapidly with cluster size, dropping from 7.9 hours at 1,024 GPUs to just 1.8 hours at 16,384 GPUs (Meta FAIR Research¹). At the same time, the cost of each failure is significant: even a single network link flap or GPU fault can cause stalls, timeouts, and eventually crash an entire job, leaving expensive clusters idle.

This webinar presents a technical comparison of three runtime resiliency strategies for distributed training. The first, checkpoint/restart, periodically saves training state to persistent storage and recovers from failures by restoring the last checkpoint and recomputing lost work. The second, live GPU migration, intercepts failures and transfers training state to spare accelerators, resuming at the same step after a short pause. The third reduces the active world size by dropping the impacted replica group, allowing training to continue immediately with altered training semantics.

The session examines the design trade-offs between these approaches across performance, training semantics, implementation complexity, and operational reliability. Attendees will come away with a clearer understanding of how each mechanism works in practice and how to evaluate them against the specific constraints of their own training infrastructure.

¹ Revisiting Reliability in Large-Scale Machine Learning Research Clusters - https://arxiv.org/html/2410.21680v2

clockworkio logo

Speakers

Suresh Vasudevan

Suresh Vasudevan, CEO and Chief Product Officer Clockwork.io

Suresh Vasudevan brings a proven track record of leadership and strong product innovation from his transformative CEO role at Sysdig. Previously, he was CEO of Nimble Storage, leading it from startup to IPO and acquisition by HPE. He also served as CEO of Omneon (acquired by Harmonic) and held product and leadership roles at NetApp. Suresh holds a B.E. in Electrical Engineering from BITS Pilani and an M.B.A. from IIM Calcutta.

Gavin Cohen

Gavin Cohen, VP of Product, Clockwork.io

Gavin brings over 20 years of product management and go-to-market leadership experience in the U.S. and Australia. He most recently served as VP of AI Product Management at ScienceLogic and has also held senior leadership roles at Zebrium, HPE, Nimble Storage and NetApp. Gavin holds a Bachelor of Computer Science from UNSW and an MBA from UTS in Sydney, Australia.

Jordan Nanos

Jordan Nanos, SemiAnalysis, Member of Technical Staff and Lead Author of ClusterMAX

Jordan Nanos is a Member of Technical Staff at SemiAnalysis focused on Neoclouds. He leads ClusterMAX and related projects, bringing hands-on experience to our research on AI system performance. Previously, Jordan was a Distinguished Technologist at HPE, where he worked on the design and implementation of some of the largest supercomputers in the world. Jordan holds a Bachelor’s degree in Electrical Engineering from Queen’s University.