Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
PagerDuty is thrilled to be named a leader in G2Crowd’s Fall 2018 Grid Report for Incident Management. The ranking is based on high customer satisfaction...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
As some of you know, PagerDuty suffered an outage for a total of 15 minutes this morning. We take the reliability of our systems very seriously, and are writing this to give you full disclosure on what happened, what we did wrong, what we did right, and what we’re doing to help prevent this in the future.
We also want to let you know that we are very sorry this outage happened. We have been working hard over the past 6 months on re-engineering our systems to be fully fault tolerant. We are tantalizingly close, but not quite there yet. Read on for the full details and steps we are taking to make sure this never happens again.
PagerDuty’s main systems are hosted on Amazon Web Services’ EC2. AWS has the concept of “Availability Zones” (AZ’s), in which hosts are intended to fail independently of hosts in other availability zones within the same EC2 region.
PagerDuty takes advantage of these availability zones and makes sure to spread its hosts and datastores across multiple AZ’s. In the event of a failure of a single AZ, PagerDuty can recover quickly by redirecting traffic to a surviving AZ very quickly.
However, it’s quite obvious that there are many situations in which all Availability Zones in a given EC2 region fail at once. From experience, these situations happen roughly every 6 months. One such region-wide failure occurred early this morning, in which AWS suffered internet connectivity issues across all of its US-East-1 region at once.
PagerDuty became inaccessible at 2:27am this morning.
Knowing that fallbacks within other AZ’s aren’t enough, PagerDuty has another fully-functional replica of its entire stack running in another (completely separately owned and operated) datacenter. We began the procedure to flip to this replica after we were notified of the problem with EC2 and when it became obvious that EC2 was having a region-wide outage.
At 2:42am (15 minutes after the start of the outage), EC2’s US-East-1 region re-appeared, and our systems started to quickly process the backlog of incoming API and email-based events, creating a large number of outgoing notifications to our customers. At this point we aborted the flip to our fallback external notifications stack.
Fifteen minutes seems like a long time between when our outage began and when we perform our flip. And it is.
We use multiple external monitoring systems to monitor PagerDuty and alert all of us when there are issues (we can’t use PagerDuty ourselves, alas!). After careful examination, the alerts from these systems were delayed by a few minutes. As a result, we responded to the outage a few minutes late.
This is obviously an action item on us to remedy as soon as possible. These minutes count. We know they are very important to you. We will look at switching or augmenting our monitoring systems as soon as possible.
Another miss on our part was not notifying all of you immediately of our outage via our emergency mass-broadcast system (see http://support.pagerduty.com/entries/21059657-what-if-pagerduty-goes-down). This was due to an internal miscommunication on when it is appropriate to use this system. We will come out with another blog post shortly that details exactly how we use this system going forward, and a reminder on how you can register yourself for it.
We’ve previously taken steps to be able to mitigate these large-scale EC2 events when they happen.
One such step is the very existence of our externally-hosted fallback PagerDuty environment. This is an (expensive) solution to this rare problem. We regularly run internal fire drills where we test and practice the procedure to flip to this environment. We will continue these drills.
Another step that we’ve taken to mitigate these large-scale EC2 events is to make sure our systems can handle the very high amounts of traffic we see when a third of our customers (all the ones hosted on EC2) all go down at the same time. We’ve made many improvements to our systems over the past 6 months: our system now queues events quickly, intelligently sheds load under high-traffic scenarios in order to continue operating, and makes absolutely sure not to fail to page any of our customers. These systems performed very well this morning, preventing further alerting delays.
A flip, no matter how quick, involves some downtime. This leaves a sour taste in our mouths. We are working (hard!) on our internal re-architecture to fully move to a notification processing system that involves NO temporary single points of failure, even when that SPOF is “all of EC2 east”.
Our new system will use a clustered multi-node datastore deployed on multiple hosts located in multiple independent data centers with different hosting providers. The new system will be able to survive a data center outage without any flips whatsoever. That’s right, we’re going flip-less (because the word “flip” is synonymous with “outage”). We are working full-steam on building this new system and deploying it as soon as possible, while making sure we stay stable during the changeover. This re-engineering effort is fairly substantial, so stand by for a few shorter term solutions.
During our internal post-mortem this morning, we have identified a few places where we can immediately improve the availability of our external event endpoints. These include building better redundancy into our email endpoint as well as our API endpoint. We are prioritizing these changes to the top of the heap.
We are also taking a closer look at moving our primary systems off of AWS US-East. In the short-term, we will continue to use US-East in some capacity (perhaps as a secondary provider). Longer term, we will switch all of our critical systems off of AWS altogether.
Finally, as mentioned above, we will improve our own monitoring systems. We’ve had alerts delivered too slowly by our own external web monitoring, and we will fix this asap. We will also improve our Twitter-based emergency broadcast procedure, which helps us announce to you when we are experiencing internal problems. Keep turned for another blog post about this in the next few days.
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...
We’re excited to share that we’re open-sourcing the tool we use to gather and transform the metrics from our managed DNS providers. We use DNSmetrics...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018