AI • Oct 22, 2025

Multi-Accelerator PyTorch Serving With NxD Inference and vLLM

Name: Multi-Accelerator PyTorch Serving With NxD Inference and vLLM
Uploaded: 2025-10-22
Description: Yahav Biran & Liangfu Chen, Amazon

PyTorch Conference 2025 Conference Collection

Multi-Accelerator PyTorch Serving With NxD Inference and vLLM - Yahav Biran & Liangfu Chen, Amazon Learn how the open-source NxD Inference library delivers high-performance PyTorch model serving on AWS Trainium and Inferentia. We’ll show how NxDI features like continuous batching, speculative decoding, and distributed parallelism can run alongside TorchInductor-compiled CUDA kernels in a single vLLM-based Kubernetes cluster, enabling real-time traffic shifting between accelerator pools.