PagerDuty: We Are Always On
With the rapid spread of COVID-19, many companies are shifting to an entirely remote workforce. During this time, being online and available to customers, vendors,...
Why Are Availability and Reliability Crucial?
In today’s digitally connected world, people expect the consumer and enterprise applications and services at their fingertips to operate seamlessly in real time, all the...
Using Postmortems to Understand Service Reliability
2017 was a year of many major outages—some took down the Internet for hours while others disrupted business workflows and communication at companies large and...
Failure Fridays: Four Years On
On June 28th, 2017, we marked four years of performing “Failure Fridays” at PagerDuty. As a quick recap, Failure Fridays are a practice we conduct...
ChaosCat: Automating Fault Injection at PagerDuty
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...
How to Ditch Scheduled Maintenance
You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution: Ditch scheduled...
Who watches the watchmen?
How we drink our own champagne (and do monitoring at PagerDuty) We deliver over 4 Million alerts each month, and companies count on us to...
Blameless post mortems – strategies for success
When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for...
A Disunity of Data: The Case For Alerting on What You See
Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business...
Outage Post Mortem – June 3rd & 4th, 2014
On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance...
Mobile Monitoring Metrics that Matter for Reliability
This is a guest blog post from Justin Liu of Crittercism, which provides mobile app performance management. Crittercism products monitor every aspect of mobile app...
Developers Need Monitoring Too
This is a guest blog post from Erik Näslund, Director of Disrapt. Erik is a back-end developer and operations guy. He created his first game...
Lessons Learned from Creating a Reliable Mobile Build
PagerDuty engineers are obsessed with reliability. Letting down customers when they’ve been paged is the worst. With that in mind, we’re always designing and thinking...
End-to-End SMS Provider Testing, It's How We Ensure SMS Alerts are Delivered
Reliability is important to us. We even inject failure into our systems every Friday to prove it. But when it comes to sending alerts, reliability goes...
Outage Post Mortem – April 14th, 2014
On April 14th, PagerDuty suffered an outage that affected customers on both the mobile and web applications. During the period of the outage, customers may...
Keep Your Website Available with the Right Monitoring Practices
In its simplest form, website monitoring is the process of testing and verifying that end-users can can actually use your service. There are several great...