AI Oct 22, 2025

No GPU Left Behind: Scaling Online LLM Training With...

No GPU Left Behind: Scaling Online LLM Training With Co-located VLLM in TRL - Mert Toslali & Yu Chin Fabian Lim, IBM Research Training LLMs with online RL methods like GRPO presents a unique challenge: inference is required at every training step. In the standard Hugging Face TRL setup, inference is handled by vLLM running as a separate server on dedicated GPUs, communicating via HTTP. This creates a “ping-pong” inefficiency—training GPUs wait during generation, and inference GPUs wait during training—leading to poor GPU utilization and high cost. Our talk introduces co-located vLLM, a key optimization that enables training and inference to run on the same GPUs. Built on vLLM’s external_launcher, it allows in-process, torch-compatible execution. We contributed a now-merged PR to TRL that eliminates the need for HTTP calls or separate servers. Our setup supports torchrun, TP/DP, and scales to training large models (like 72B). This setup improves training throughput by up to 1.7×, reduces # of GPUs needed, and is now part of the official TRL repo.