Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
“I need to be notified if there’s a significant event ongoing with SignalFx.” This is what I tell my team. However, despite being the CTO...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In order to recover from the outage, inflight data from the system was purged and resulted in failed notification delivery, failure to accept incoming events from our integration API, and a significant number of delayed notifications.
We would like to apologize to all of our customers who were affected by this outage. This was a very serious issue, and are taking steps to prevent an outage of this scale from happening again in the future.
Our notification pipeline relies on a NoSQL datastore called Cassandra. This open source, distributed, highly available and decentralized datastore powers some of the most popular sites/services on the internet. Cassandra has also proven to be a complicated system that is difficult to manage and tune correctly.
On June 3rd, the repair process on one of the nodes in our Cassandra cluster started normal operation. This Cassandra background repair process, used to keep stored data consistent across the cluster, puts substantial strain on Cassandra. This impacted how well our datastore performed. The repair process, in combination with additional high workload being applied at the time, put the Cassandra cluster into a heavily degraded state.
To remedy the situation, our team decreased the load on the cluster. As part of this, the repair process was stopped. While this temporarily resolved the incident, the cluster experienced six hours of oscillating between periods of stability and instability. We then eliminated communication between some of the nodes in an attempt to stabilize the cluster, and eventually normal operations resumed.
During this outage, PagerDuty’s Notification Pipeline was degraded to a point where approximately 3% of events sent to our integration API could not be received, and a small number of notifications (a fraction of 1%) experienced delayed delivery.
On June 4th, our team manually restarted the repair process that had been postponed on the 3rd. Despite disabling a substantial amount of optional system load, the repair process eventually reintroduced the previous day’s outage to our system. Unfortunately, this subsequent outage was much more damaging: during the course of this outage we were unable to receive 14.9% of events sent to our integrations API, while 27.5% of notifications were not delivered at all, and 60.9% of notifications were delayed more than 5 minutes.
At first we attempted to reproduce our process from the previous day to get Cassandra stabilized, but these efforts did not have the same result. After several additional attempts to stabilize the notification pipeline performance, it was decided to take a drastic measure to regain control of the pipeline: a “factory reset”, deleting all inflight data in the notification pipeline. This allowed the team to gradually restore service, leading to stabilization of the pipeline and a return to regular operation. Cassandra immediately recovered after the “reset”, although some of our downstream systems required manual intervention to get their data consistent with the new “blank slate”.
Though our systems are now fully operational, we are still in the process of conducting our root cause analysis, as we need to understand why our stabilization approaches didn’t work. Fundamentally, however, we know that we were underscaled, and we know that we were sharing the cluster amongst too many different services with disparate load and usage patterns.
Moving forward our top priority is to make sure an outage like this does not affect our customers again. We take reliability incredibly seriously at PagerDuty and will be prioritizing projects that will help make our system more stable in the future. Here are a few of the changes we will be undertaking to prevent this type of outage from occurring again:
Vertically scaling the existing Cassandra nodes (bigger, faster servers)
Setting up multiple Cassandra clusters and distributing our workloads across them
Establishing system load thresholds at which, in the future, we will proactively horizontally scale-up our existing Cassandra clusters
Upgrade the current & new clusters to a more recent version of Cassandra
Implement further load shedding techniques to help us control Cassandra at high loads
Bring additional Cassandra expertise in-house
One last thing that needs to be mentioned: we had already decided to take some of the above actions as we had noticed similar issues recently. We had made some of our planned improvements already, but unfortunately we had decided to do the rest of the improvements in an order that was based mostly on efficiency: we had decided to do the Cassandra version upgrade before the vertical + horizontal scaling. Unfortunately we ran out of time. It’s now evident that the scaling had to happen first, and since the 4th, we have already completed the vertical scaling and are partway through the splitting of our Cassandra clusters based on workload and usage.
If you have any questions or concerns, shoot us an email at firstname.lastname@example.org.
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...
We’re excited to share that we’re open-sourcing the tool we use to gather and transform the metrics from our managed DNS providers. We use DNSmetrics...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018