Enabling Lightweight, High-Performance FSDP With NVIDIA GPU - Jianbin Chang CN, Cory Ye, Xuwen Chen & Sangkug Lym, NVIDIA Fully sharded data parallel (FSDP) is quickly becoming the dominant distributed training technique for large models due to its impressive performance and ease of adoption. However, existing FSDP implementations often face communication and memory inefficiencies at scale. Here, we introduce Megatron-FSDP, which further enhances the performance of FSDP with native PyTorch compatibility. The key contributions of Megatron-FSDP include: Add support for persistent communication buffers to enable user-buffer registration in low-resource collectives and mitigate memory fragmentation. Leverage NVLink/InfiniBand-SHARP to offload collective operations from GPU SMs. Provide native TransformerEngine FP8 support. Enable scalable MoE support through tight Megatron-Core integration with expert parallelism. Adheres to zero-copy principles, avoiding parameter/gradient copy overhead when sharding. Our implementation is fully compatible with PyTorch APIs and deeply supports Megatron-Core and TransformerEngine. We share deployment tips—NVLink/IB SHARP offloading, operator fusions, FP8 workflows—demonstrating how Megatron-FSDP streamlines large-scale training.