Why Is ML on Kubernetes Hard? Defining How ML and Software Diverge
🎥 Recorded live at the MLOps World | GenAI Summit 2025 — Austin, TX (October 8, 2025) Session Title: Why Is ML on Kubernetes Hard? Defining How ML and Software Diverge Speakers: • Donny Greenberg, Co-Founder & CEO, Runhouse • Paul Yang, Member of Technical Staff, Runhouse Talk Track: ML Training Lifecycle Abstract: Why is machine learning on Kubernetes so much harder than software deployment? In this session, Donny Greenberg and Paul Yang from Runhouse dissect the reasons ML engineers still face friction where software engineers have streamlined workflows. They trace the evolution of early ML platforms—from Facebook’s FBLearner to modern orchestration tools—and explain how these reference implementations shaped today’s infrastructure pain points. You’ll learn the key ways ML diverges from software engineering, including GPU dependencies, lack of local testing, distributed framework heterogeneity (Ray, Spark, PyTorch, TensorFlow, Dask), and the challenges of observability at scale. Finally, they introduce Kubetorch — a new Kubernetes-native compute platform that bridges the gap between iterable, debuggable Python APIs and Kubernetes-first scalable execution, bringing the ergonomics of platform engineering to ML teams. What you’ll learn: • Why ML on Kubernetes remains complex despite mature tooling • How ML workflows differ fundamentally from traditional software engineering • Lessons from the evolution of ML platforms like FBLearner and Kubeflow • How Kubetorch provides a clean abstraction for scalable, iterative ML development • Why ML teams need better platform engineering, not just DevOps.
