DevOps Jun 24, 2019

Managing Large-Scale Kubernetes Clusters Effectively and Reliably

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don’t miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Managing Large-Scale Kubernetes Clusters Effectively and Reliably - Yong Zhang & Zhixian Lin, Ant Financial

As the business grows, we need to deploy Kubernetes into several data centers all around the world. There are more than ten thousands of Nodes in a single data center. The critical challenge we are facing is how to manage several large-scale Kubernetes clusters across data centers with efficiency and reliability. In this talk, we will share the experince and practices of automating large-scale cluster management. At first, we will introduce fully automated Node lifecycle management, and how to automatically discover and recover Node failures based on NPD, Autoscalers and customized Operator. Then we will share the experience and solutions of Kubernetes cluster deployment and upgrading. Finally, we will share the risk prevention and control system based on Prometheus and Operator, which is the cornerstone of reliability with the ability of automatic faults detection and isolation.

https://sched.co/NroZ