PagerDuty Blog

ChaosCat: Automating Fault Injection at PagerDuty

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” — Principles of Chaos Engineering

Netflix, Dropbox, and Twilio are all examples of companies that perform this kind of engineering. It’s essential to have confidence in large, robust, distributed systems. At PagerDuty, we’ve been performing controlled fault injection into our production infrastructure for several years. As time has passed, and our infrastructure has grown, our Chaos Engineering practices have evolved as well. One somewhat recent addition is an automated fault injector, which we call ChaosCat.

Background

In the beginning, the SRE team at PagerDuty specifically chose to inject failures into our infrastructure manually, via SSH’ing and executing commands per-host. This allowed us to have precise control over the fault, quickly learn and investigate issues that arose, and avoid heavy upfront investment in tooling. This worked well for a while and allowed us to build up a library of well-understood and repeatable chaos attacks such as high network latency, high CPU usage, host restarts, etc.

We knew doing things manually wouldn’t scale up, so as time went on we began to automate portions of the process. First, the individual commands were turned into scripts, then automated dispatching them to hosts instead of SSH’ing, and on and on. Once individual teams started to own their own services at PagerDuty, this tooling enabled them to do to their own fault injection without needing a central SRE team.

However, early on we had chosen to make the process of injecting faults known ahead of time to individual service owners. This meant that every Friday, those owners would be at least somewhat aware of what to look for. Which meant they’d have a head start on fixing any problems.

The real world rarely gives advance notice of failure, so we wanted to introduce the element of chance into the infrastructure, by allowing a subset of attacks to be performed at random across any host. So we started adding additional tooling to pick random hosts and perform chaos attacks on them. The last piece of the puzzle was putting it all together on an automated schedule. Enter ChaosCat.

Implementation

ChaosCat is a Scala-based Slack chat bot. It builds on top of several other components of our infrastructure, such as our distributed task execution engine. It’s heavily inspired by Chaos Monkey, but more service-implementation-agnostic, as we have a variety of service types in our infrastructure.

First, it’s running as an always-on service. This means it can be used for one-off runs (@chaoscat run-once) at any time by any authorized team. In the meantime, during idle periods a schedule is checked every minute — we only want randomized failures injected during a subset of business hours when there are certain to be awake and ready on-call engineers.

Second, once it’s during business hours, it checks to see if the system status is all-clear. We don’t want to inject a failure if the overall health of our service isn’t 100%.

Third, it fires off a randomly chosen chaos attack (with different attacks having different selection probabilities) to a random host within our infrastructure (no exemptions allowed, as all hosts are equally vulnerable to these issues in the real world). It sends a task to run the chaos attack via the Blender execution framework linked above, using our in-house job runner.

Fourth, it waits 10 minutes and then runs steps two and three again, over and over during a subset of scheduled business hours. If issues arise, attacks can always be stopped by anyone by sending @chaoscat stop.

Learnings

Some teams quickly learned that there’s a world of difference between sitting at the ready with all of your dashboards and logs pulled up, and having something go wrong while you’re getting your morning coffee. These teams identified gaps in their run books and on-call rotations and fixed them. Success!

Another interesting thing: we found that after teams got over their initial discomfort, they automated fixes that had previously been done manually and prioritize technical debt items in their backlog correctly, because the failures causing them had been so infrequent beforehand. This, in turn, caused those teams to have more confidence in their services’ reliability.

Unfortunately, ChaosCat is significantly tied into our internal infrastructure tooling. For the moment this means we won’t be open-sourcing it. However, we’d love to get your feedback and questions about it, so ask away in the PagerDuty Community forums or in the comments below!

We hope that more companies start to practice this kind of reliability engineering — or as some like to say, chaos engineering — it’s a fantastic way to verify the robustness and behavior of increasingly complex and diverse infrastructure.