Amazingly Fast and Incredibly Scalable Inference...
Sponsored Session: Amazingly Fast and Incredibly Scalable Inference with NVIDIA’s Dynamo and TensorRT-LLM - Harry Kim & Laikh Tewari, NVIDIA The explosive growth of large language models demands new techniques to efficiently handle inference at scale. In this talk we’ll show how you can easily extend your PyTorch programs with NVIDIA’s Dynamo and TensorRT-LLM to go faster and wider. Dynamo is a low-latency inference framework that works with any inference engine, including TensorRT-LLM, vLLM, and SGLang. TensorRT-LLM is an open-source library for state-of-the-art LLM inference, recently rebuilt using native PyTorch. We’ll describe how Dynamo and TensorRT-LLM use the latest inference techniques, like disaggregated serving, large scale expert parallelism, and helix parallelism, and show their performance on the latest NVIDIA hardware.