Name: Efficient Inference Serving with Kubernetes Gateway...
Uploaded: 2025-10-22
Description: Lin Sun

Sponsored Session: Lightning Talk: Efficient Inference Serving with Kubernetes Gateway API Inference Extension - Lin Sun, Solo.io & CNCF Large Language Models (LLMs) have become the backbone of modern AI applications, but serving them efficiently and reliably at scale remains a major challenge. Traditional approaches to API serving, built around HTTP or gRPC traffic, do not adequately address the unique requirements of LLM inference workloads, such as variable prompt lengths, dynamic compute needs, the size and efficiency of the model. In this session, we introduce an efficient and scalable stack for inference serving built on vLLM and the Kubernetes Gateway API Inference Extension. By combining vLLM’s optimized serving runtime with the smart routing of the Gateway API, we demonstrate how to achieve scalable and resource-efficient serving across heterogeneous LLMs in Kubernetes.