Outage Post Mortem – June 3rd & 4th, 2014

by John Laban June 12, 2014 | 5 min read

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In order to recover from the outage, inflight data from the system was purged and resulted in failed notification delivery, failure to accept incoming events from our integration API, and a significant number of delayed notifications.

We would like to apologize to all of our customers who were affected by this outage. This was a very serious issue, and are taking steps to prevent an outage of this scale from happening again in the future.

What Happened?

Our notification pipeline relies on a NoSQL datastore called Cassandra. This open source, distributed, highly available and decentralized datastore powers some of the most popular sites/services on the internet. Cassandra has also proven to be a complicated system that is difficult to manage and tune correctly.

On June 3rd, the repair process on one of the nodes in our Cassandra cluster started normal operation. This Cassandra background repair process, used to keep stored data consistent across the cluster, puts substantial strain on Cassandra. This impacted how well our datastore performed. The repair process, in combination with additional high workload being applied at the time, put the Cassandra cluster into a heavily degraded state.

To remedy the situation, our team decreased the load on the cluster. As part of this, the repair process was stopped. While this temporarily resolved the incident, the cluster experienced six hours of oscillating between periods of stability and instability. We then eliminated communication between some of the nodes in an attempt to stabilize the cluster, and eventually normal operations resumed.

During this outage, PagerDuty’s Notification Pipeline was degraded to a point where approximately 3% of events sent to our integration API could not be received, and a small number of notifications (a fraction of 1%) experienced delayed delivery.

 On June 4th, our team manually restarted the repair process that had been postponed on the 3rd. Despite disabling a substantial amount of optional system load, the repair process eventually reintroduced the previous day’s outage to our system. Unfortunately, this subsequent outage was much more damaging: during the course of this outage we were unable to receive 14.9% of events sent to our integrations API, while 27.5% of notifications were not delivered at all, and 60.9% of notifications were delayed more than 5 minutes.

At first we attempted to reproduce our process from the previous day to get Cassandra stabilized, but these efforts did not have the same result. After several additional attempts to stabilize the notification pipeline performance, it was decided to take a drastic measure to regain control of the pipeline: a “factory reset”, deleting all inflight data in the notification pipeline. This allowed the team to gradually restore service, leading to stabilization of the pipeline and a return to regular operation.  Cassandra immediately recovered after the “reset”, although some of our downstream systems required manual intervention to get their data consistent with the new “blank slate”.

Though our systems are now fully operational, we are still in the process of conducting our root cause analysis, as we need to understand why our stabilization approaches didn’t work.  Fundamentally, however, we know that we were underscaled, and we know that we were sharing the cluster amongst too many different services with disparate load and usage patterns.

What are we doing?

Moving forward our top priority is to make sure an outage like this does not affect our customers again. We take reliability incredibly seriously at PagerDuty and will be prioritizing projects that will help make our system more stable in the future. Here are a few of the changes we will be undertaking to prevent this type of outage from occurring again:

  • Vertically scaling the existing Cassandra nodes (bigger, faster servers)

  • Setting up multiple Cassandra clusters and distributing our workloads across them

  • Establishing system load thresholds at which, in the future, we will proactively horizontally scale-up our existing Cassandra clusters

  • Upgrade the current & new clusters to a more recent version of Cassandra

  • Implement further load shedding techniques to help us control Cassandra at high loads

  • Bring additional Cassandra expertise in-house

One last thing that needs to be mentioned:  we had already decided to take some of the above actions as we had noticed similar issues recently.  We had made some of our planned improvements already, but unfortunately we had decided to do the rest of the improvements in an order that was based mostly on efficiency:  we had decided to do the Cassandra version upgrade before the vertical + horizontal scaling.  Unfortunately we ran out of time.  It’s now evident that the scaling had to happen first, and since the 4th, we have already completed the vertical scaling and are partway through the splitting of our Cassandra clusters based on workload and usage.

If you have any questions or concerns, shoot us an email at