Arctic Inference: Breaking the Speed-Cost Tradeoff in LLM Serving
Sponsored Session: Arctic Inference: Breaking the Speed-Cost Tradeoff in LLM Serving - Aurick Qiao, Snowflake Arctic Inference is an open-source extension to vLLM, developed by Snowflake AI Research, that delivers state-of-the-art optimizations for large language model (LLM) inference. It introduces techniques such as SwiftKV, Ulysses, Shift Parallelism, Suffix Decoding, and Arctic Speculator to significantly improve inference speed and efficiency. Across practical workloads, Arctic Inference achieves up to 3.4x faster time-to-first-token (TTFT), 1.7x higher throughput, and 1.75x faster time-per-output-token (TPOT) compared to baseline open-source solutions. Its modular, pluggable design means developers using vLLM can adopt it without code changes and immediately benefit from the performance gains. This talk will delve into the technical innovations behind Arctic Inference and show you how to unlock these open-source optimizations in your own vLLM workflows.