Lifecycle of a Parameter
Lifecycle of a Parameter - Philip Bontrager, Meta Long gone are the days where the entire lifecycle of a parameter consists of being initialized on a GPU, updated on the same GPU, and then finally saved to your local SSD. As models and training pipelines scale in size and complexity, a single parameter will get sharded, resharded, streamed across multiple machines, downloaded, possibly quantized, and renamed multiple times. In this talk we’ll follow a parameter through a large-scale LLM RL post-training job to understand everything that needs to happen behind the scenes in Pytorch for this to work. From reading in a checkpoint that’s 100s of GBs, to distributing the parameter across multiple dimensions, quantizing the weights, syncing the weights with an inference server that that is laid out and optimized completely differently, updating all of the shards together, and then saving the parameter split across multiple devices, each part has to work together and happen in almost no time at all. This talk will outline the challenges and some of the current solutions to training parameters at scale.