Category
Reliability

PagerDuty Customer Support & Advocacy Team Wins Stevie® Award

We are delighted to announce that our Customer Support and Advocacy team won the Silver Stevie® Award in the Customer Service Department of the Year category in the 2015 International Business Awards. The award demonstrates PagerDuty’s commitment to its customers, as evidenced by a satisfaction rating that averaged 98.3 percent throughout 2014.

The Discovery of Apache ZooKeeper’s Poison Packet

ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.

PagerDuty Introduces Team Organization Feature

No matter what team you’re on, PagerDuty helps you resolve incidents faster. DevOps involves collaboration across multiple teams for better reliability and quality assurance. Having a central, shared tool like PagerDuty to manage incidents across the company makes that collaboration a heck of a lot simpler. Our new team organization feature makes it even easier for different teams like Operations, Development, and Customer Support to work together. Here’s how

Will Automated Alerting Replace the NOC?

If you have a Network Operations Center (or NOC, as the kids call it), you have a skilled set of eyes monitoring your system and alerting your engineers when things go wrong. (If you have something like a NOC, such as a first tier team that processes tickets, we’re looking at you, too). You also probably have strict SLAs and a need for high availability at all times. You can’t waste a second when things go down. Solutions like PagerDuty that help you identify and resolve incidents faster can help you improve your Network Operations Center performance. These solutions can shave minutes off your time to detect incidents (one of our customers took 8 minutes off theirs) and can make it easier for NOC personnel to escalate to experts when needed. We’ve found five ways that our customers use PagerDuty to enhance their NOCs.

Best Practices in Outage Communication: Customers

Outages are chaotic, and it can be difficult to figure out the best way to let your  customers know what is going on. One of the first big decisions you’ll need to make is whether you’re going to respond only to people who inquire about the issue, or if you’re going to be more proactive and post updates publicly. Many of the leading technology companies have begun to transparently discuss outages with their customers, and there are a number of good business reasons for doing so. Regardless of your approach, here are 6 things you can do to ensure successful customer communication during outages.

How to Ditch Scheduled Maintenance

You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution: Ditch scheduled maintenance altogether. That sounds like a bold proposition. But as Doug explained at DevOps Days Chicago, it actually makes a lot of sense. Scheduled maintenance tends to take place late […]

Who watches the watchmen?

How we drink our own champagne (and do monitoring at PagerDuty) We deliver over 4 Million alerts each month, and companies count on us to let them know when they have outages. So, who watches the watchmen? Arup Chakrabarti, PagerDuty’s engineering manager, spoke about how we monitor our own systems at DevOps Days Chicago earlier […]

Blameless post mortems – strategies for success

When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for running blameless post mortems. Failure is inevitable in complex systems. While it’s tempting to find a single person to blame, according to Sidney Dekker, these failures are usually the results […]

A Disunity of Data: The Case For Alerting on What You See

Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business at all levels of the stack. The assumption underlying all monitoring systems is the existence of an entity that we cannot fully control. A thing we have created, like an […]

Outage Post Mortem – June 3rd & 4th, 2014

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In order to recover from the outage, inflight data from the system was purged and resulted […]