Failure Fridays: Four Years On

On June 28th, 2017, we marked four years of performing “Failure Fridays” at PagerDuty.  As a quick recap, Failure Fridays are a practice we conduct weekly at PagerDuty to inject faults into our production environment in a controlled way, and without customer impact. They’ve been foundational for us to verify our resiliency engineering efforts.

Over the years, our process has evolved, sometimes following Chaos Engineering principles, sometimes not. But the constant of Failure Friday has always been to help us identify and fix problems before they impact our customers.  Here are a few milestones from our journey, and some of the lessons we learned:

Timeline

2013

  • June: The first Failure Friday!

2014

  • February: Our first security-focused Failure Friday.  Instead of testing a single individual service’s fault resiliency through isolation, reboots, and such, we tested for a variety of edge cases, such as APIs being sent invalid data, and firewall misconfiguration.  This practice is still in use today with certain Failure Fridays reserved not for injecting faults on a single service, but instead to test for infrastructure-wide anti-patterns.
  • April:  Less than a year after we started Failure Fridays, we simulated the full failure of one of the seven different Availability Zones our infrastructure operated in at the time.  The first time we did this we went a little overboard with our paranoia and simulated it meticulously over the course of four separate sessions.  Now we usually complete it in around one.

2015

  • January: After 18 months and 33 sessions, we finally automated a lot of the manual commands from the original Failure Friday post into ChatOps-based tooling.  By doing the steps manually at first, we validated and learned from them without having to spend a lot of time up front.  As we grew as a company, it became more and more difficult to ramp up new folks, so we enlisted the help of our company bot:

 

  • January:  Once we’d gotten comfortable with the idea of losing a single Availability Zone, we stepped up our game to taking out an entire Region.  These still usually take us a few sessions to complete, as they always generate new learnings for us.
  • March: We realized that Failure Fridays were a great opportunity to exercise our Incident Response process, so we started using it as a training ground for our newest Incident Commanders before they graduated.
  • May: As we started to scale up the number of services and teams maintaining them, we started keeping more formal documentation on planned faults, checklists for future sessions, outcomes of fault injections, and so on.  “It’s not science unless you write it down.”

 

2016

  • April: Another year on, and another large scale set of Failure Friday fault testing – we began simulating failover to our Disaster Recovery infrastructure.  During normal operations, we validate our DR tooling with a small percentage of live traffic, but during these scenarios we ramp up that percentage of live traffic, taking care not to impact our customers.
  • June: We introduced “Reboot Roulette” to our suite of automation, randomly selecting hosts (with weighting for different categories of hosts) to be injected with a fault (rebooting was the first fault of several added, because alliteration of course).

 

 

  • September: At a Hackday, Chaos Cat is introduced, using all of the existing tooling to automate fault injection (at separate time from our normal Failure Friday window).

 

2017

  • July: We formed an internal guild of engineers within PagerDuty across multiple teams, all interested in Chaos Engineering.

Stats

Going back through our Failure Friday records, here’s a few metrics from June 28th, 2013 to June 28th, 2017:

  • Failure Friday sessions: 121
  • Tickets created to fix issues identified in Failure Friday: over 200
  • Faults injected: 644
  • Fault injections that resulted in a public postmortem: 3
  • Simulated full AZ failures (disable all services in a given AZ): 4
  • Simulated full Region failures (disable all services in a given region): 3
  • Simulated partial Disaster Recovery (send all traffic to another region): 2
  • Distinct services within PagerDuty that have had faults injected: 47

Conclusions

Injecting failure and continuously improving our infrastructure has not only helped us deliver better software, but also build internal trust and empathy. Stress testing our systems and processes helps us understand how to improve our operations — and you can do it too.

 

Share

Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn