Accelerating PyTorch FSDP Via Overlapping Collectives With In...
Lightning Talk: Accelerating PyTorch FSDP Via Overlapping Collectives With In-Network Computing and Multicast - Nariaki Tateiwa, Nippon Telegraph and Telephone The rapid growth of LLM scale has exceeded accelerator memory limits; as a result, memory-efficient distributed training has become essential. In PyTorch, Fully Sharded Data Parallel (FSDP) has become a critical solution to address this challenge. However, FSDP suffers from significant inter-accelerator communication overhead, particularly with the ReduceScatter and Allgather collectives in the backward pass. Recent research by Khalilov et al. (SC’ 24) proposes a scheme for overlapping these collectives using in-network computing and multicast technologies, potentially halving communication volume in the backward pass. However, there are no practical collective communication libraries that enable both technologies, and thus the current PyTorch FSDP (v1/v2) implementations do not support overlapping these collectives. In this talk, we present (1) an extension to the UCC library that leverages in-network computing and InfiniBand multicast; (2) an implementation that enables overlapping ReduceScatter and Allgather in PyTorch FSDP; and (3) numerical results demonstrating up to 2x faster communication and significant improvements in training throughput for LLM workloads.