ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.
No matter what team you’re on, PagerDuty helps you resolve incidents faster. DevOps involves collaboration across multiple teams for better reliability and quality assurance. Having a central, shared tool like PagerDuty to manage incidents across the company makes that collaboration a heck of a lot simpler. Our new team organization feature makes it even easier for different teams like Operations, Development, and Customer Support to work together. Here’s how
If you have a Network Operations Center (or NOC, as the kids call it), you have a skilled set of eyes monitoring your system and alerting your engineers when things go wrong. (If you have something like a NOC, such as a first tier team that processes tickets, we’re looking at you, too). You also probably have strict SLAs and a need for high availability at all times. You can’t waste a second when things go down. Solutions like PagerDuty that help you identify and resolve incidents faster can help you improve your Network Operations Center performance. These solutions can shave minutes off your time to detect incidents (one of our customers took 8 minutes off theirs) and can make it easier for NOC personnel to escalate to experts when needed. We’ve found five ways that our customers use PagerDuty to enhance their NOCs.
Outages are chaotic, and it can be difficult to figure out the best way to let your customers know what is going on. One of the first big decisions you’ll need to make is whether you’re going to respond only to people who inquire about the issue, or if you’re going to be more proactive and post updates publicly. Many of the leading technology companies have begun to transparently discuss outages with their customers, and there are a number of good business reasons for doing so. Regardless of your approach, here are 6 things you can do to ensure successful customer communication during outages.
You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution: Ditch scheduled maintenance altogether. That sounds like a bold proposition. But as Doug explained at DevOps Days Chicago, it actually makes a lot of sense. Scheduled maintenance tends to take place late […]
How we drink our own champagne (and do monitoring at PagerDuty) We deliver over 4 Million alerts each month, and companies count on us to let them know when they have outages. So, who watches the watchmen? Arup Chakrabarti, PagerDuty’s engineering manager, spoke about how we monitor our own systems at DevOps Days Chicago earlier […]
When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for running blameless post mortems. Failure is inevitable in complex systems. While it’s tempting to find a single person to blame, according to Sidney Dekker, these failures are usually the results […]
Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business at all levels of the stack. The assumption underlying all monitoring systems is the existence of an entity that we cannot fully control. A thing we have created, like an […]
On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In order to recover from the outage, inflight data from the system was purged and resulted […]
This is a guest blog post from Justin Liu of Crittercism, which provides mobile app performance management. Crittercism products monitor every aspect of mobile app performance, allowing Developers and IT Operations to deliver high performing, highly reliable, highly available mobile apps. Mobile apps are now critical for all types of businesses. Whether your company builds […]