AI Oct 22, 2025

Serving PyTorch LLMs at Scale: Disaggregated Inference With Kubernetes and Llm-d

Serving PyTorch LLMs at Scale: Disaggregated Inference With Kubernetes and Llm-d - Maroon Ayoub, IBM Research & Cong Liu, Google As PyTorch-based LLMs scale in complexity and user concurrency, their inference demands diverge across stages. Prefill is compute-heavy; decode is latency-sensitive. In this talk, we introduce a disaggregated serving pattern for PyTorch LLMs using llm-d—a Kubernetes-native, open-source framework co-developed by IBM Research, Google, and Red Hat. We’ll walk through how llm-d separates prefill and decode into orchestrated sidecars, improving GPU utilization and QoS alignment. You’ll learn how the Gateway API Inference Extension (GIE) enables routing based on load, cache locality, and session affinity. The talk includes real-world benchmarks and a visual demo of llm-d serving PyTorch models with vLLM across heterogeneous hardware on Kubernetes.