The Sounds of Silence: Lessons from an 18 hour API outage
The Sounds of Silence: Lessons from an 18 hour API outage by Paul Zaich
Sometimes applications are behaving “normally” along strict definitions of HTTP statuses but under the surface, something is terribly wrong. In 2017, Checkr’s most important API endpoint went down for 12 hours without detection. In this talk I’ll talk about this incident, how we responded (what went well and what could have gone better) and explore how we’ve hardened our systems today with simple monitoring patterns.