A UTC Leap second vs Derecho
As we mentioned in our last post, we survived both. A few people mentioned that my graph actually made the leap second look worse:
Pretty much everyone at PagerDuty is in the on call rotation. I’m lucky in that I’m unimportant enough that I only get called in when all the bells and whistles are going off so I didn’t get called in on the leap second, or the second minor AWS outage. So from my perspective, the AWS outage was worse — but looking at the graph, it looks like the leap second was worse so I’m tempted to investigate.
Incidents are a good measure of how much is breaking on the internet, but they aren’t the best measure of load on our system. Since we do a lot of work at the account level (de-duping and escalation), I took a look at how many accounts were performing an action at a time.
These curves are a little less abrupt, since they include acknowledgements and resolutions. One clue though, is after I ran the first graph with hourly resolution and the second graph with a more narrow resolution, the curves change a little. So I ran the first query again with a finer grain, and included the different alert types:
Bingo! Counterintuitively, the AWS spike hit faster than the leap second. This might make some sense if the leap second hit different machines as they tried to schedule events in the future, whereas the AWS outage was unexpected.
The AWS Spike was 30 times as high as the average amount of traffic at the peak of the storm, whereas the AWS echo outage and the leap second were only 21 and 18 times as high respectively. The averages are reversed, the AWS outage averaged 7 times higher over 2 hours, but the leap second spike was 9 times as high — and keep in mind that the “average” that I’m comparing to, is the average for the weekend in question, which was hardly an average weekend itself.