The Future Is Tiled: Using CuTile and TileIR To Write Portable, High-performance GPU Kernels - Jared Roesch, NVIDIA While tools like OpenAI Triton and PyTorch Inductor now enable Python programmers to write high-performance kernels with ease, the challenge remains in balancing algorithmic abstraction with evolving HW and performance optimization. In this talk we’ll describe cuTile, a Python DSL that lets you author fast, portable CUDA kernels and TileIR, a new virtual ISA for NVIDIA GPUs that cuTile targets via a TileIR MLIR dialect. cuTile extends CUDA kernel programming in Python to be closer to PyTorch or NumPy building on TileIR to make it easier to innovate on new high-performance programing abstractions. TileIR is an array-based sibling abstraction to PTX that enables both forward compatibility and performance for GPU architecture-specific features such as tensor cores, across hardware generations, without time consuming rewrites. We’ll show you how to write and use cuTile kernels, target TileIR, present performance results and present contributions such as our TileIR backend for Torch Inductor. cuTile and TileIR were introduced at GTC 2025. cuTile will be open-sourced and open to community contributions and Tile IR will be released as part of the CUDA toolkit.

The Future Is Tiled: Using CuTile & TileIR To Write Portable, High-performance GPU...- Jared Roesch