PagerDuty
/
Blog
/
Reliability
/
Outage Post-Mortem – April 13, 2013

Blog

Outage Post-Mortem – April 13, 2013

by Baskar Puvanathasan April 24, 2013 | 4 min read

We spend enormous amount of our time on the reliability of PagerDuty and the infrastructure that hosts it. Most of this work is invisible, hidden behind the API and the user interface our customers interact with. However, when they fail, they become very noticeable as delays in notifications and 500s on our API endpoints. That’s what happened on Saturday, April 13, at around 8:00am Pacific Time. PagerDuty suffered an outage triggered by degradation in a peering point used by two AWS regions.

We are writing this post to let our customers know what had happened, what we have learned and what we’ll do to fix all the issues uncovered by this outage.

Background

PagerDuty’s infrastructure is hosted in three different datacenters (two in AWS and another in Linode). For the past year, we’ve been rearchitecting our software with the goal of it being able to survive the outage of an entire datacenter (including it being partitioned from the network), but something not specifically built into our design was the ability to survive the failure of two datacenters at once. However unlikely, that is what happened on Saturday morning. Since we consider an AWS region as a datacenter, and having both of them fail at the same time, we weren’t able to remain available with only our last remaining datacenter.

We picked our three datacenters to have no dependency amongst them, and made sure that they are physically separated. However, we have since learned that two of the datacenters shared a common peering point. This peering point experienced an outage that resulted in both of our datacenters going offline.

The Outage

Note: All times referenced below are in Pacific Time.

At 7:57am, according to AWS, connectivity issue begins due to a peering point degrading in Northern California
At 8:11am, PagerDuty on-call engineer is paged about an issue with the one of the nodes in our notification dispatch system
At 8:13am, an attempt is made to bring back the failed node but with no success
At 8:18am, our monitoring system detects multiple-provider failure for notifications (caused by connectivity issue). At this time, most of the notifications are still going through, but with increased latencies and error rates
At 8:31, a Sev-2 was declared and more engineers were paged to help out
At 8:35am, PagerDuty completely loses its ability to dispatch notification, as it couldn’t establish quorum due to high network latency. Sev-1 is declared
At 8:53am, PagerDuty notification dispatch system was able to reach quorum and started to process all queued notifications
At 9:23am, according to AWS, connectivity issue at the Northern California peering point ends

During the post-mortem analysis, our engineers also determined that a misconfiguration on our coordinator service prevented us from recovering quickly. In all, PagerDuty wasn’t able to dispatch notifications for 18 minutes between 8:35am and 8:53am; however, during this time, our events API was still able to accept events.

What we’re going to do

As always with major outages, we learn something new about deficiencies in our software. These are some of our plans to rectify the discovered issues.

Short term

During our analysis, we found that we didn’t have adequate logging to debug issues within some of our systems. We have now added more logging and started to aggregate them into a single source for better searchability.
During the outage, most of the failed coordinator processes were restarted manually. We are going to add a process watcher to restart such processes automatically.
We also found that we didn’t have good visibility into the inter-host connectivity. We’ll be building a dashboard that shows this.

Long term

We also found that not all of our engineering staff are up to date with Cassandra and ZooKeeper. We’ll be investing time to train our staff on both of these technologies.
Investigate moving off one of the AWS regions. We’ll need to do our homework when picking a new hosting provider and the datacenter to avoid single point of failure.

reliability

Incident Management

AIOps

Automation

Customer Service Ops

Status Pages

Stakeholders Communications

Integrations

PagerDuty Copilot

Developer Platform

Professional Services

Security

Enterprise Class

Integrations

Blog

Outage Post-Mortem – April 13, 2013

Background

The Outage

What we’re going to do

Short term

Long term

You may also love these...

End to End (E2E) Testing Best Practices

PagerDuty: We Are Always On

Using Real-Time Operations to Save Lives