Outage Post-Mortem – May 30, 2013
As a member of PagerDuty’s realtime engineering team, a top concern is designing and implementing our systems with high availability and reliability. On May 30, 2013 we had a brief outage that resulted in a degradation of our alerting reliability. This post summarizes what happened and what we are doing moving forward to ensure it doesn’t happen again.
On May 30, 2013 at 22:50 UTC, our on-call engineers were paged due to an issue in the Linode Fremont datacenter. This particular datacenter was experiencing network latency issues, as verified by Linode on their status page about 6 minutes later.
As a result of this issue, some of our backup worker processes started automatically. The backup worker processes handle sending notifications from our various notification queues, specifically to pick up the slack from workers that are offline.
Unfortunately, these processes had some poor error handling. Due to the datacenter outage, error rates were of course higher than normal. As a result, this delayed the processing of some notifications. Over the period of the outage window, 7% of the total outgoing alerts were delayed an unacceptable amount of time. All notifications were ultimately delivered and no notifications were lost.
- At 22:50 UTC, our on-call engineers are alerted to network connectivity problems in the Linode Fremont datacenter.
- At 22:56 UTC, Linode confirms network connectivity issues in their Fremont datacenter.
- At 23:07 UTC, our engineers notice that backup worker processing on one our notification queues has stalled, so this process is restarted manually.
- At 23:14 UTC, processing of notifications is back to normal.
- At 23:30 UTC, Linode confirms that the network connectivity problems are fixed.
How we’re fixing it
The bug that we encountered during this particular outage has been fixed. While we do test all of our code extensively, this particular bug was missed. Because this code path only becomes critical in the event of a datacenter outage, we weren’t able to catch the problem until it revealed itself in our production environment.
We are going to do a better job at testing code that runs in exceptional situations. Designing systems to handle datacenter failures isn’t enough on its own: we have to continuously test that they’re functioning as designed.
While we do perform controlled failure testing in production, we don’t currently do it often enough nor do we test enough failure cases. We will very soon institute a regular “Failure Friday”, where we actively try to instigate an extensive set of controlled failures. Over time, we hope to transition to using our own Chaos Monkey that will create these conditions continuously and randomly.