Outage Post Mortem – Jan 23, 2014
At PagerDuty, our customers rely on us to be highly-available and reliable when their infrastructure may not be. Unfortunately, sometimes bugs may surface in our software. When these incidents occur, we make sure that we offer transparency for our customers who may have been negatively affected. We apologize for any lapse in service and are committed to preventing issues repeating themselves in the future.
In the early hours of Thursday, January 23rd, 2014, we experienced an outage related to new functionality that was released to support our mobile apps. The condition was caused by a specific type of slow database query, whose effects were compounded by a server configuration that caused each of the slow queries to retry multiple times.
The slow queries created a high load on one of our database servers that caused some users to experience a delay in receiving their notifications, as well as some loss of incoming events. This occurred over an 18 minute window between 1:02am and 1:20am.
How We Responded
We were able to quickly identify the slow queries that were causing the issue, and responded by manually terminating them. We followed up with rolling back the version of the mobile app in the Google Play Store for Android customers. Because the iOS AppStore app publishing process is less expedient, we also removed the backend functionality we had introduced the previous day. Since the 23rd, we have refactored the offending code, and the resulting new query is nearly three orders of magnitude faster than the original. We have also enabled a slow query killer that will proactively identify non-performant queries and terminate them.
Measures We’re Taking Moving Forward
We will be putting new processes in place for auditing database queries, both new and old. We generally test our code extensively, but we can improve on covering edge cases and different permutations of parameters. The new processes will help formalize that by introducing query design reviews, more rigorous performance testing, and regular slow query analysis.
We will also be reviewing our server configurations to ensure that any long-running queries get automatically stopped, and that they won’t be continually retried across different servers.
Lastly, we will put in more monitoring to catch these types of issues sooner and will continue to refactor and modularize our backend infrastructure so that performance issues in a single system won’t affect unrelated systems