Reliability

End to End (E2E) Testing Best Practices

When it comes to the applications, websites, and services we build, the end user ultimately determines whether or not the end product is successful. Even...

PagerDuty

6 min read

End-to-End Provider Testing

Product, Reliability

PagerDuty: We Are Always On

With the rapid spread of COVID-19, many companies are shifting to an entirely remote workforce. During this time, being online and available to customers, vendors,...

Tim Armandpour

3 min read

COVID-19, failure Fridays, reliability, remote work, SLA

HumanOps, Reliability, Use Cases & Solutions

Using Real-Time Operations to Save Lives

Voices wield power. Staying silent is not an option. We must speak up and honor those who do. October is National Domestic Violence Awareness Month,...

PagerDuty

5 min read

community, Impact Pricing, PD Social Impact, pd.org

DevOps, PagerDuty Life, Reliability

ChaosCat: Automating Fault Injection at PagerDuty

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...

Eric Sigler

5 min read

automating failure, chaos cat, distributed systems, failure friday, fault injection, injecting failure, reliability

Reliability, Technology

DNSmetrics: Unified Metrics Collection From Multiple DNS Providers

We’re excited to share that we’re open-sourcing the tool we use to gather and transform the metrics from our managed DNS providers. We use DNSmetrics...

Max Timchenko

4 min read

DNS, DNSmetrics, Open Source

Features, On-Call Life, Operations Performance, Reliability

7 Benefits of Incident Management in Supporting Applications

Incident management is a key facet of supporting applications. When working on an application, we spend the vast majority of time on its release to...

Eric Jeanes

4 min read

Announcements, Community, Reliability

PagerDuty Customer Support & Advocacy Team Wins Stevie® Award

We are delighted to announce that our Customer Support and Advocacy team won the Silver Stevie® Award in the Customer Service Department of the Year category in the 2015 International Business Awards. The award demonstrates PagerDuty’s commitment to its customers, as evidenced by a satisfaction rating that averaged 98.3 percent throughout 2014.

Sam Lewis

3 min read

Alerting, Community, Operations Performance, Reliability

The Discovery of Apache ZooKeeper’s Poison Packet

ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum. The leader election and failure detection mechanisms are fairly mature, and typically just work... until they don't. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.

Evan Gilman

15 min read

Features, On-Call Life, Reliability

PagerDuty Introduces Team Organization Feature

No matter what team you’re on, PagerDuty helps you resolve incidents faster. DevOps involves collaboration across multiple teams for better reliability and quality assurance. Having a central, shared tool like PagerDuty to manage incidents across the company makes that collaboration a heck of a lot simpler. Our new team organization feature makes it even easier for different teams like Operations, Development, and Customer Support to work together. Here’s how

Sam Lewis

2 min read

Alerting, Best Practices & Insights, Reliability

Best Practices in Outage Communication: Customers

Outages are chaotic, and it can be difficult to figure out the best way to let your customers know what is going on. One of the first big decisions you’ll need to make is whether you’re going to respond only to people who inquire about the issue, or if you’re going to be more proactive and post updates publicly. Many of the leading technology companies have begun to transparently discuss outages with their customers, and there are a number of good business reasons for doing so. Regardless of your approach, here are 6 things you can do to ensure successful customer communication during outages.

Sam Lewis

8 min read

Operations Performance, Reliability

How to Ditch Scheduled Maintenance

You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution: Ditch scheduled...

Julie Arsenault

3 min read

reliability

Events, Operations Performance, Reliability

Who watches the watchmen?

How we drink our own champagne (and do monitoring at PagerDuty) We deliver over 4 Million alerts each month, and companies count on us to...

Julie Arsenault

4 min read

reliability

Operations Performance, Reliability

Blameless post mortems – strategies for success

When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for...

Julie Arsenault

5 min read

reliability

Partnerships, Reliability

A Disunity of Data: The Case For Alerting on What You See

Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business...

Vivian Au

6 min read

librato, monitoring alert, monitoring analytics, monitoring signal, reliability

Reliability

Outage Post Mortem – June 3rd & 4th, 2014

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance...

John Laban

5 min read

outage, post mortem, reliability

Partnerships, Reliability

Mobile Monitoring Metrics that Matter for Reliability

This is a guest blog post from Justin Liu of Crittercism, which provides mobile app performance management. Crittercism products monitor every aspect of mobile app...

Vivian Au

3 min read

crittercism, mobile monitoring, reliability

Partnerships, Reliability

Developers Need Monitoring Too

This is a guest blog post from Erik Näslund, Director of Disrapt. Erik is a back-end developer and operations guy. He created his first game...

Vivian Au

8 min read

Alert Notifications, Monitoring, reliability

Reliability

Lessons Learned from Creating a Reliable Mobile Build

PagerDuty engineers are obsessed with reliability. Letting down customers when they’ve been paged is the worst. With that in mind, we’re always designing and thinking...

Clay Smith

9 min read

application, empire.js, javascript, mobile app, reliability

Incident Management

AIOps

Automation

Customer Service Ops

Status Pages

Stakeholders Communications

Integrations

PagerDuty Copilot

Developer Platform

Professional Services

Security

Enterprise Class

Integrations

Reliability

End to End (E2E) Testing Best Practices

PagerDuty: We Are Always On

Using Real-Time Operations to Save Lives

ChaosCat: Automating Fault Injection at PagerDuty

DNSmetrics: Unified Metrics Collection From Multiple DNS Providers

7 Benefits of Incident Management in Supporting Applications

PagerDuty Customer Support & Advocacy Team Wins Stevie® Award

The Discovery of Apache ZooKeeper’s Poison Packet

PagerDuty Introduces Team Organization Feature

Best Practices in Outage Communication: Customers

How to Ditch Scheduled Maintenance

Who watches the watchmen?

Blameless post mortems – strategies for success

A Disunity of Data: The Case For Alerting on What You See

Outage Post Mortem – June 3rd & 4th, 2014

Mobile Monitoring Metrics that Matter for Reliability

Developers Need Monitoring Too

Lessons Learned from Creating a Reliable Mobile Build