Improving a Distributed System Post-Incident
This talk was presented at Failover Conf on April 21, 2020.
In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.
This year, my team faced a week long incident for our IP address management system which impacted out customers. From this incident, we had had …