Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In part 2 of our postmortem series, we dig into how to establish a culture of continuous learning, from getting leadership on board to invoking...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
On January 24, 25 and 26, 2013, PagerDuty suffered several outages. The events API, used by our customers to submit monitoring events into PagerDuty from monitoring tools, was down during the outages. Our web application, used to access and configure customer accounts, was also affected and may have been unavailable during the outages.
We’ve written this post-mortem to let you know what happened and to also let you know what we’re doing to ensure this never happens again. Last but not least, we would like to apologize for this outage. While we didn’t have any single prolonged outage during this period, we strongly believe in the mantra that even 2 minutes of downtime is unacceptable and we’d like to let you know we’re working hard on improving our availability, both in the short term and the long term.
The PagerDuty infrastructure is hosted in multiple data centers (DCs). The notification dispatch component of PagerDuty is fully redundant across 3 DCs and can survive a DC outage without any downtime. We’ve designed the system to use a distributed data store which doesn’t require any sort of failover or flip when an entire DC goes offline.
However, the events API, which is backed by a queuing system, still relies on our old legacy database system, based on a traditional RDBMS. This system has a primary database which is synchronously replicated to a secondary host. The system also has a tertiary database which is asynchronously replicated (just in case both the primary and secondary have problems). If the primary host goes down, our standard operating procedure is to do a flip to the secondary host. The downside is that the flip process requires a few minutes of downtime.
Note: All times referenced below are in Pacific time.
Later on that day, we had several blips:
Throughout the day, we worked on investigating the issue and worked on the post-mortem. As part of the investigation, we noticed a large number of invocations of a particular slow query on the database. We modified the code to turn off the invocation of the offending query. At this point, we thought the outages were caused by a single slow query, which we had fixed, so we thought the underlying problem was also fixed.
We investigated the new outage and found another problematic slow query, which we fixed immediately.
At this point, we came to the conclusion that the best thing to do is upgrade the db machine to a larger host. Engineers worked through the night to build all new db machines (primary, secondary and tertiary) on better hardware.
Around 6am, we believed the building of the new machines was complete. From 6:15am to 914am, we attempted to flip the database to a new primary machine a couple of times, each time unsuccessful. Each of these attempts caused a few minutes of downtime.
At this point, we gave up on flipping to the new machine. The reason the flip did not work was because the data snapshot on the new machine was not uploaded correctly, due to the engineers being extremely tired and burned out after working through the night on the upgrades.
After getting rest for about 12 hours, the engineers started from scratch building new db machines. The freshly rested engineers put a new primary database in place. A few hours afterwards, they also put in an upgraded secondary database and an upgraded tertiary database.
We will set up rigorous monitoring for slow queries on our data store [already done]. We will also automate the building of a new database server via chef. The db server was one of the last components to be chef’ed in our infrastructure, and on 1/26 and 1/27, we re-built db machines by hand instead of using chef, which was a time consuming and error-prone process.
We will also instituted a more rigorous development process, whereby new features and changes to the code base must be vetted for database performance as part of the regular code review process [already done].
We will also set up better host metrics for the database server so we can detect early on if and when we are approaching capacity and upgrade the server in an orderly way.
We will remove the dependency of our events API from our main RDBMS database. To give a bit more context, our events API is backed by a queue: incoming events are enqueued, and background workers process queued events. The reason for this is so we can properly handle and process large volumes of event traffic.
Currently, this queue is reliant on our main SQL database. As explained above, this DB is fully redundant with 2 backups across 2 data centers, but requires a failover when the main (primary) db goes down.
As a result of this post-mortem, we will fast-track a project to re-architect the events API queue and workers to use our newer distributed data store. This data store is distributed across 5 nodes and 3 independent data centers, and it’s designed to survive the outage of an entire data center without requiring any failover process and without any downtime whatsoever.
Voices wield power. Staying silent is not an option. We must speak up and honor those who do. October is National Domestic Violence Awareness Month,...
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019