PagerDuty Blog


Gain Greater Context with Rich Incidents

The site is down. Alarms are going off. Before you can fix anything, you first have to understand what’s going on. And gaining context can be hard as you look across multiple systems and metrics. We’re pleased to announce Rich Incidents, a new feature for PagerDuty that helps incident responders gain additional context. Now, responders can go straight from an alert to a conference bridge, chat room, or runbook, giving them instantaneous access to each other and to any documentation they might need. Additionally, embedded graphs give more context into an incident, helping you respond faster and maintain a dependable product for your customers.


In Alerting, Announcements, Features


The Discovery of Apache ZooKeeper’s Poison Packet

ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.


In Alerting, Community, Operations Performance, Reliability


Boundary Integrates with PagerDuty

When it comes to monitoring the health of your IT system, the team at Boundary lives by the philosophy that every second counts. Rather than letting data slip through the cracks at five- or even one-minute intervals, Boundary provides real-time monitoring of servers, platforms and apps for IT and DevOps teams with one-second resolution. We’re excited to announce an integration with PagerDuty to help teams resolve infrastructure incidents even faster.


In Announcements, Partnerships


Report from ServiceNow Knowledge 15

Last week, we sponsored a booth and participated with all of ServiceNow’s awesome partners at Knowledge 15 in Las Vegas! ServiceNow is a a powerful platform-as-a-service for IT teams. We heard success stories from customers who use PagerDuty to enhance their ServiceNow experience.


In Events


Best Practices in Outage Communication: Internal Stakeholders

When you’re in the middle of an outage, the last thing you want is people from all over the company constantly asking you when it’s going to be fixed. Your job is busy enough without having to play translator and communication whiz when you have more important things to be worried about. But at the same time, your outage affects people outside of your team. You can’t neglect communicating with internal stakeholders like your manager, or your CTO, or your CEO, or your marketing department, or you customer support team. You see where I’m going with this. So how do you keep your internal stakeholders informed in a timely, efficient fashion?


In Alerting, Best Practices & Insights


Introducing PagerDuty Integration for Threat Stack

We love PagerDuty and are big users ourselves. We love the ease of integration with our other platforms. We love the scheduling and overrides. We love the per-service escalation groups. We love the sound of our default alert setting, the sad trombone (Though, the more we think about it, “love” isn’t the right word on that last one. That infernal trombone wakes up our team to let us know there is trouble in the Cloud. We dread that trombone).


In Announcements, Partnerships


London Conference Wrap-Up

Last week our team went on an overseas adventure, sponsoring AWS Summit London and Puppet Camp UK. We heard over and over at AWS Summit that our international customers love our reliable multi-provider SMS, phone, push, and email alerting to over 175 countries (and growing!). Our international SMS alerts all come from local numbers in the countries we alert, so when engineers ack, they don’t incur international fees. International customers are also big fans of UTF-8 support throughout our incident pipeline, so messages in non-western character sets render correctly.


In Alerting


Cut Your Resolution Time with AppDynamics and PagerDuty

Application Performance Monitoring (APM) systems like AppDynamics can provide incredibly rich information about what’s happening with your IT infrastructure, and can identify performance issues before they create big problems. However, this information is only as good as your ability to respond to it. PagerDuty can extend the capabilities of AppDynamics Alert & Respond policies to ensure incidents are noticed, responded to, and fixed quickly.


In Alerting, Announcements, Community, Events, Partnerships


All, None, and One: The SQL Left Join Trick

PagerDuty released Multi-User Alerting in early 2014, which allowed notifying and assigning multiple people when an incident is triggered. In addition to assigning multiple users to an incident, multi-user alerting also makes it possible for an incident to have multiple acknowledgers. This post will demonstrate the changes we made to our data model to implement multi-user alerting and the resulting sophistication added to our SQL queries to maintain their performance.


In Tech Talk


PagerDuty Recap: AWS Summit – San Francisco

Last week, PagerDuty had the pleasure of attending AWS Summit on our home turf in San Francisco. It was nothing short of epic. The Amazon Web Services crowd is definitely our niche. AWS is a critical component of the PagerDuty platform, and our three founders actually came up with the idea for PagerDuty while working at Amazon! So, naturally, we feel totally in our element surrounded by other AWS fans at these types of events.


In Events


PagerDuty Introduces Team Organization Feature

No matter what team you’re on, PagerDuty helps you resolve incidents faster. DevOps involves collaboration across multiple teams for better reliability and quality assurance. Having a central, shared tool like PagerDuty to manage incidents across the company makes that collaboration a heck of a lot simpler. Our new team organization feature makes it even easier for different teams like Operations, Development, and Customer Support to work together. Here’s how


In Features, On-Call Life, Reliability


Webmon Joins the PagerDuty Partner Ecosystem

Today we’re announcing the integration of PagerDuty with Webmon, a website monitoring and escalation service that lets you be the first to know when an online service goes down.


In Alerting, Announcements, Community, Distributed Systems, Partnerships


PagerDuty Sponsoring AWS London, Puppet Camp London, and a Beer-Fueled Get-Together

PagerDuty is delighted to announce it’s heading to London for its first international conferences, ever. We’re proud to sponsor AWS Summit in London on Wednesday, April 15 and Puppet Camp London on Monday, April 13. We have customers in over 110 countries and we’re very excited about meeting with some of our 350+ UK customers.


In Alerting


Best Practices in Outage Communication: Incident Team

You’ve just realized that something has gone critically wrong, and you can’t fix it yourself. Particularly if you work within a collaborative DevOps environment, it’s better to get by with a little help from you friends. Effectively coordinating the incident response across subject matter experts and front-line responders is a secret to operational success that differentiates top teams. So it’s important that you have an effective and efficient way to to sound the alarm, and make sure that your conversations are recorded and actionable.


In Alerting, Best Practices & Insights


PagerDuty User Group

We hosted our first user group last week at PagerDuty HQ! Not only did we gather our awesome customers and enjoy the taco bar and cervezas, but we got to learn a lot from our them, share our roadmap – and our customers learned from each other, too. We really value user feedback as part of how and why we build our product. We wanted to share some key takeaways from our sessions during the event.


In Community, Events, On-Call Life


Not Enough Cat Photos? Introducing OkCats

PagerDuty alerts. Feeding a newborn gremlin. FOMO. These are the things that keep us up at night. Here at PagerDuty, we know that nothing settles the nerves like eye cuddling a fluffy, adorable cat. That why we’re proud to announce OkCats.


In Alerting


Flowdock and PagerDuty Integration Update

When your service goes down, there’s no time to waste. With sweaty palms and an elevated heart rate, you need to figure out what’s wrong, all while communicating your status to your users. Coordinating with your team is complex enough – there’s no room for unnecessary actions. This is where Flowdock’s new and greatly improved PagerDuty integration comes into play.


In Alerting, Announcements, Partnerships


Best Practices in Outage Communication: Customers

Outages are chaotic, and it can be difficult to figure out the best way to let your  customers know what is going on. One of the first big decisions you’ll need to make is whether you’re going to respond only to people who inquire about the issue, or if you’re going to be more proactive and post updates publicly. Many of the leading technology companies have begun to transparently discuss outages with their customers, and there are a number of good business reasons for doing so. Regardless of your approach, here are 6 things you can do to ensure successful customer communication during outages.


In Alerting, Best Practices & Insights, Reliability