LLM Inference: A Comparative Guide to Modern Open-Source Runtimes
🎥 From the MLOps World | GenAI Summit 2025 — Virtual Session (October 6, 2025) Session Title: LLM Inference: A Comparative Guide to Modern Open-Source Runtimes Speaker: Aleksandr Shirokov, Team Lead MLOps Engineer, Wildberries Talk Track: LLMs on Kubernetes Abstract: Deploying large language models at scale isn’t one-size-fits-all. In this technical deep dive, Aleksandr Shirokov shares how the Wildberries AI team built and battle-tested a production-grade LLM serving platform using vLLM, Triton TensorRT-LLM, Text Generation Inference (TGI), and SGLang. You’ll get a detailed look at their custom benchmarking setup, the trade-offs across runtimes, and when each framework makes sense—depending on model size, latency targets, and workload patterns. The talk also covers: • Implementing HPA for vLLM and reducing cold start times with Tensorize • Co-locating multiple vLLM models per pod to save GPU memory • Using SAQ-based queue wrappers for fair and efficient request handling • Wrapping endpoints with Kong for per-user rate limits, token quotas, and observability Finally, Aleksandr shares insights from running DeepSeek R1-0528 in production, maintaining flexibility while keeping cost and complexity under control. What you’ll learn: • Why there’s no single best LLM serving stack • How to benchmark, deploy, and optimize multiple runtimes effectively • Trade-offs between frameworks like vLLM, TGI, Triton, and SGLang • How to design an LLM inference setup that fits your use case.
