AI Oct 22, 2025

Sponsor Session: Low-Precision Inference without Quality Loss...

Sponsor Session: Low-Precision Inference without Quality Loss: Selective Quantization and Microscaling - Pankaj Gupta & Philip Kiely, Baseten Everyone wants faster inference, but no one wants to compromise the quality of their model outputs. FP8 quantization offers 30-50% lower latencies for inference on large models, but must be applied carefully to maintain quality. Recently, NVIDIA Blackwell GPUs introduced new microscaling number formats (MXFP8, MXFP4, NVFP4) and new kernel options for low-precision inference. In this talk, Baseten inference engineers will cover practical applications of quantization to quality-sensitive inference tasks with a focus on selecting which parts of the inference system to quantize (weights, activations, KV cache, attention) and how microscaling number formats help preserve dynamic range.