Outage Post-Mortem
As you may already know, PagerDuty suffered an outage of 30 minutes yesterday, followed by a period of increased alert delivery times. We’re taking the downtime...
As you may already know, PagerDuty suffered an outage of 30 minutes yesterday, followed by a period of increased alert delivery times. We’re taking the downtime...
Updated on 9/21: We have replaced Twitter with our status page as a communication method. At PagerDuty we strive for 100% uptime, and it is a...
Today, at around 1am Pacific Time, Amazon began having major problems with some of their cloud infrastructure: specifically with their EC2, EBS, and RDS offerings. We'd like to share some statistics on the alerts we sent out - via phone or SMS - during the outage.
This post is meant as a quick introduction to some concepts of system availability, so that subsequent posts in this series make sense. I'll go over concepts like availability, SLA, mean time between failure, mean time to recovery, etc.
We've added deep linking to the incidents table. The browser will now remember all your interactions with the table as you move throughout your account or recall your bookmarks.
We’ve been hosting PagerDuty on AWS for about the last year. One of the biggest draws to the platform for us was the promise of ready-built components...