Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
“I need to be notified if there’s a significant event ongoing with SignalFx.” This is what I tell my team. However, despite being the CTO...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
As a member of PagerDuty’s realtime engineering team, a top concern is designing and implementing our systems with high availability and reliability. On May 30, 2013 we had a brief outage that resulted in a degradation of our alerting reliability. This post summarizes what happened and what we are doing moving forward to ensure it doesn’t happen again.
On May 30, 2013 at 22:50 UTC, our on-call engineers were paged due to an issue in the Linode Fremont datacenter. This particular datacenter was experiencing network latency issues, as verified by Linode on their status page about 6 minutes later.
As a result of this issue, some of our backup worker processes started automatically. The backup worker processes handle sending notifications from our various notification queues, specifically to pick up the slack from workers that are offline.
Unfortunately, these processes had some poor error handling. Due to the datacenter outage, error rates were of course higher than normal. As a result, this delayed the processing of some notifications. Over the period of the outage window, 7% of the total outgoing alerts were delayed an unacceptable amount of time. All notifications were ultimately delivered and no notifications were lost.
The bug that we encountered during this particular outage has been fixed. While we do test all of our code extensively, this particular bug was missed. Because this code path only becomes critical in the event of a datacenter outage, we weren’t able to catch the problem until it revealed itself in our production environment.
We are going to do a better job at testing code that runs in exceptional situations. Designing systems to handle datacenter failures isn’t enough on its own: we have to continuously test that they’re functioning as designed.
While we do perform controlled failure testing in production, we don’t currently do it often enough nor do we test enough failure cases. We will very soon institute a regular “Failure Friday”, where we actively try to instigate an extensive set of controlled failures. Over time, we hope to transition to using our own Chaos Monkey that will create these conditions continuously and randomly.
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...
We’re excited to share that we’re open-sourcing the tool we use to gather and transform the metrics from our managed DNS providers. We use DNSmetrics...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018