Voices wield power. Staying silent is not an option. We must speak up and honor those who do. October is National Domestic Violence Awareness Month,…

| In DevOps, PagerDuty Life, Reliability

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in…

| In Reliability, Technology

We’re excited to share that we’re open-sourcing the tool we use to gather and transform the metrics from our managed DNS providers. We use DNSmetrics…

Incident management is a key facet of supporting applications. When working on an application, we spend the vast majority of time on its release to…

| In Announcements, Community, Reliability

We are delighted to announce that our Customer Support and Advocacy team won the Silver Stevie® Award in the Customer Service Department of the Year category in the 2015 International Business Awards. The award demonstrates PagerDuty’s commitment to its customers, as evidenced by a satisfaction rating that averaged 98.3 percent throughout 2014.

ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.

| In Features, On-Call Life, Reliability

No matter what team you’re on, PagerDuty helps you resolve incidents faster. DevOps involves collaboration across multiple teams for better reliability and quality assurance. Having a central, shared tool like PagerDuty to manage incidents across the company makes that collaboration a heck of a lot simpler. Our new team organization feature makes it even easier for different teams like Operations, Development, and Customer Support to work together. Here’s how

| In Alerting, Reliability

If you have a Network Operations Center (or NOC, as the kids call it), you have a skilled set of eyes monitoring your system and alerting your engineers when things go wrong. (If you have something like a NOC, such as a first tier team that processes tickets, we’re looking at you, too). You also probably have strict SLAs and a need for high availability at all times. You can’t waste a second when things go down. Solutions like PagerDuty that help you identify and resolve incidents faster can help you improve your Network Operations Center performance. These solutions can shave minutes off your time to detect incidents (one of our customers took 8 minutes off theirs) and can make it easier for NOC personnel to escalate to experts when needed. We’ve found five ways that our customers use PagerDuty to enhance their NOCs.

Outages are chaotic, and it can be difficult to figure out the best way to let your  customers know what is going on. One of the first big decisions you’ll need to make is whether you’re going to respond only to people who inquire about the issue, or if you’re going to be more proactive and post updates publicly. Many of the leading technology companies have begun to transparently discuss outages with their customers, and there are a number of good business reasons for doing so. Regardless of your approach, here are 6 things you can do to ensure successful customer communication during outages.

| In Operations Performance, Reliability

You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution: Ditch scheduled…

| In Events, Operations Performance, Reliability

How we drink our own champagne (and do monitoring at PagerDuty) We deliver over 4 Million alerts each month, and companies count on us to…

| In Operations Performance, Reliability

When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for…

| In Partnerships, Reliability

Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business…

| In Reliability

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance…

| In Partnerships, Reliability

This is a guest blog post from Justin Liu of Crittercism, which provides mobile app performance management. Crittercism products monitor every aspect of mobile app…

| In Partnerships, Reliability

This is a guest blog post from Erik Näslund, Director of Disrapt. Erik is a back-end developer and operations guy. He created his first game…

| In Reliability

PagerDuty engineers are obsessed with reliability. Letting down customers when they’ve been paged is the worst. With that in mind, we’re always designing and thinking…

| In Reliability

Reliability is important to us. We even inject failure into our systems every Friday to prove it. But when it comes to sending alerts, reliability goes…

Search