AI Oct 22, 2025

Mojo + PyTorch: A Simpler, Faster Path To Custom Kernels

Mojo + PyTorch: A Simpler, Faster Path To Custom Kernels - Spenser Bauman, Modular PyTorch offers multiple ways to integrate custom kernel implementations, especially when working with CUDA. While this flexibility is powerful, it leads to fragmentation across tools, build systems, and APIs. Developers often run into long compile times, complex toolchain requirements, and ABI challenges that make custom op development harder to maintain and distribute. This talk explores a new workflow for writing custom ops in PyTorch using Mojo, a high-performance systems language for AI. It shows how Mojo can be used to define custom kernels that integrate with PyTorch through Python, avoiding the need for C++, CUDA, or complex build tools. The approach offers a straightforward and portable way to develop high-performance custom ops. We’ll walk through: - A look at the current landscape of PyTorch custom kernel integration - How Mojo improves ergonomics and speeds up development - Using Mojo-based ops in eager mode and with torch.compile - Examples of accelerating inference with Mojo - Extending to training by implementing the backwards pass in Mojo This talk is for PyTorch developers who want more control over performance without the overhead of traditional CUDA development.