All Talks

Improving a Distributed System Post-Incident

This talk was presented at Failover Conf on April 21, 2020.

In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.

This year, my team faced a week long incident for our IP address management system which impacted out customers. From this incident, we had had …

You might be interested in: