Outage Post-Mortem – May 30, 2013
As a member of PagerDuty’s realtime engineering team, a top concern is designing and implementing our systems with high availability and reliability. On May 30,...
As a member of PagerDuty’s realtime engineering team, a top concern is designing and implementing our systems with high availability and reliability. On May 30,...
We spend enormous amount of our time on the reliability of PagerDuty and the infrastructure that hosts it. Most of this work is invisible, hidden...
On January 24, 25 and 26, 2013, PagerDuty suffered several outages. The events API, used by our customers to submit monitoring events into PagerDuty from...
You’re a techie working for one of the multitude of startups that rushed to market, where the founders hastily glued a Rails app together with candy-bar wrappers and...
A few weeks ago I had the privilege of speaking at Surge 2012 in Baltimore, MD. The audience were of those whose focus was on better...
This is a guest post by Connie Quach, Sr. Product Manager, responsible for the web performance products at Neustar. In today’s competitive environment, website performance...
Sometimes you just have to tinker. Experimentation, trial and error are all part and parcel of the learning experience, and the gateway to bigger and...
At PagerDuty, we usually get a front seat to anything that’s wrong with the internet. Last weekend, a derecho storm took out 7% of AWS...
On the evening of Friday, June 29th, Amazon Web Services (AWS) experienced a major outage at its North Virginia location due to a loss of...
We have some very exciting news for all of our customers who are running mission-critical systems on AWS in the US-East region: we have migrated...
On Thursday, June 14, starting at 8:44pm Pacific time, PagerDuty suffered a serious outage. The application experienced 30 minutes of downtime, followed by a period...
As some of you know, PagerDuty suffered an outage for a total of 15 minutes this morning. We take the reliability of our systems very...
As a general rule, whatever percentage you think your test coverage is, it isn’t. Whatever amount of the known surface area you’re covering, there’s going...
This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known...
We support any monitoring tool that can send an email or make a JSON call, but we support tighter integration with some than others. We...
This is the third in a series of posts on increasing overall availability of your service or system. In the first post of this series, we...
Like pretty much everything else in Rails, optimistic locking is nice and easy to setup: you simply add a “lock_version” column to your ActiveRecord model...
This is the second in a series of posts on increasing overall availability of your service or system. In the first post of this series,...