AI Oct 22, 2025

Multi-Accelerator PyTorch Serving With NxD Inference and vLLM

Multi-Accelerator PyTorch Serving With NxD Inference and vLLM - Yahav Biran & Liangfu Chen, Amazon Learn how the open-source NxD Inference library delivers high-performance PyTorch model serving on AWS Trainium and Inferentia. We’ll show how NxDI features like continuous batching, speculative decoding, and distributed parallelism can run alongside TorchInductor-compiled CUDA kernels in a single vLLM-based Kubernetes cluster, enabling real-time traffic shifting between accelerator pools.