PagerDuty Blog

Outage Post Mortem – April 14th, 2014

On April 14th, PagerDuty suffered an outage that affected customers on both the mobile and web applications. During the period of the outage, customers may have had issues managing their accounts, and some alerts had been delayed. When these incidents occur, we make sure that we offer transparency for our customers who may have been negatively affected. We apologize for any lapse in service and are committed to preventing issues repeating themselves in the future.

What Happened?

An increase in workload on our event processing system caused a performance degradation in its work queue. Though this size of a workload is slightly unusual it is not unexpected and should only cause a delay in processing. However, the decreased performance lead to timeouts in an upstream system with a retry on fail policy. The retries ultimately caused significant load on our systems leading to availability issues for a period of approximately 30 minutes. Though no events were lost and all alerts were sent, 39% of events were delayed beyond our 5 minutes SLA during the outage.

How We Responded?

Very shortly after the availability issues began, our operations and engineering teams began working to alleviate the problem. Actions were taken soon after to reduce the stress on the system by removing duplicate queued events caused by the retries, bringing our systems back into normal operation.

What are we doing about this?

In the short term, we immediately adjusted the retry policy in the upstream system to ensure that an expected slowdown would not cause a series of unwanted retries. In the long term, we have two initiatives underway that will prevent this from reoccurring. The first is rebalancing timeout and retry policies across the board, along with related additions such as idempotent request handling (where appropriate). The second is the separation of event processing from our customer facing applications to ensure greater isolation allowing us to better manage reliability and performance.

We apologize if this outage affected your team’s ability to receive alerts in a timely manner. As always, if you have any questions or concerns you may contact us at