In-Cluster Distributed Checkpointing: Optimizing Training...
Lightning Talk: In-Cluster Distributed Checkpointing: Optimizing Training Goodput Through GPU Node Locality - Gerson Kroiz, Google & Saurabh Mishra, Meta As AI model training scales to thousands of GPUs, resiliency becomes essential. Failures—due to preemption, crashes, or infrastructure issues—can cause major training inefficiency and delay time-to-market. Checkpointing enables recovery, but traditional methods relying on remote storage often introduce latency and scalability challenges. This talk presents In-Cluster Checkpointing, a new feature built on PyTorch’s Distributed Checkpointing (DCP) APIs that leverages node-local storage for faster, more scalable checkpointing. Each GPU node saves and restores its local training state, enabling frequent, low-overhead checkpoints that reduce training progress lost and restart latency. To support node replacement (e.g., due to failure or preemption), local checkpoints are automatically replicated and transferred to new nodes during recovery. Co-developed by Google Cloud’s GPU Resiliency team and Meta’s Distributed Checkpointing team, this solution has improved training goodput by up to 5% in large-scale deployments—saving many thousands of GPU hours over multi-week run. Attendees will gain practical insights on integrating this technique to improve goodput in their own training jobs.