DevOps Jun 24, 2019

Minimizing GPU Cost for Your Deep Learning on Kubernetes

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don’t miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Minimizing GPU Cost for Your Deep Learning on Kubernetes - Kai Zhang & Yang Che, Alibaba

More and more data scientists run their Nvidia GPU based deep learning tasks on Kubernetes. Meanwhile, it’s found over 40% cost are wasted on idle GPU in the cluster. So one important challenge is how Kubernetes can help to improve GPU usage efficiency. In this talk we will introduce a GPU sharing solution on native Kubernetes. All design and implementation details will be discussed. Key topics include, - How to define GPU sharing API - How to make GPU sharing can be scheduled in the Kubernetes cluster without changing scheduler bare bone code. - How to integrate GPU isolation solution with Kubernetes A demo will be shown to illustrate how Tensorflow users to run different jobs on the same GPU device in Kubernetes cluster. In practise of the solution , overall GPU usage gets remarkable improvement, especially for AI model develop, debug and inference services.

https://sched.co/Nrnk