Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In The Hitchhiker’s Guide to the Galaxy, a group of scientist mice built a mega-computer named “Deep Thought” to Answer “The Ultimate Question of Life...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” — Principles of Chaos Engineering
Netflix, Dropbox, and Twilio are all examples of companies that perform this kind of engineering. It’s essential to have confidence in large, robust, distributed systems. At PagerDuty, we’ve been performing controlled fault injection into our production infrastructure for several years. As time has passed, and our infrastructure has grown, our Chaos Engineering practices have evolved as well. One somewhat recent addition is an automated fault injector, which we call ChaosCat.
In the beginning, the SRE team at PagerDuty specifically chose to inject failures into our infrastructure manually, via SSH’ing and executing commands per-host. This allowed us to have precise control over the fault, quickly learn and investigate issues that arose, and avoid heavy upfront investment in tooling. This worked well for a while and allowed us to build up a library of well-understood and repeatable chaos attacks such as high network latency, high CPU usage, host restarts, etc.
We knew doing things manually wouldn’t scale up, so as time went on we began to automate portions of the process. First, the individual commands were turned into scripts, then automated dispatching them to hosts instead of SSH’ing, and on and on. Once individual teams started to own their own services at PagerDuty, this tooling enabled them to do to their own fault injection without needing a central SRE team.
However, early on we had chosen to make the process of injecting faults known ahead of time to individual service owners. This meant that every Friday, those owners would be at least somewhat aware of what to look for. Which meant they’d have a head start on fixing any problems.
The real world rarely gives advance notice of failure, so we wanted to introduce the element of chance into the infrastructure, by allowing a subset of attacks to be performed at random across any host. So we started adding additional tooling to pick random hosts and perform chaos attacks on them. The last piece of the puzzle was putting it all together on an automated schedule. Enter ChaosCat.
ChaosCat is a Scala-based Slack chat bot. It builds on top of several other components of our infrastructure, such as our distributed task execution engine. It’s heavily inspired by Chaos Monkey, but more service-implementation-agnostic, as we have a variety of service types in our infrastructure.
First, it’s running as an always-on service. This means it can be used for one-off runs (@chaoscat run-once) at any time by any authorized team. In the meantime, during idle periods a schedule is checked every minute — we only want randomized failures injected during a subset of business hours when there are certain to be awake and ready on-call engineers.
Second, once it’s during business hours, it checks to see if the system status is all-clear. We don’t want to inject a failure if the overall health of our service isn’t 100%.
Third, it fires off a randomly chosen chaos attack (with different attacks having different selection probabilities) to a random host within our infrastructure (no exemptions allowed, as all hosts are equally vulnerable to these issues in the real world). It sends a task to run the chaos attack via the Blender execution framework linked above, using our in-house job runner.
Fourth, it waits 10 minutes and then runs steps two and three again, over and over during a subset of scheduled business hours. If issues arise, attacks can always be stopped by anyone by sending @chaoscat stop.
Some teams quickly learned that there’s a world of difference between sitting at the ready with all of your dashboards and logs pulled up, and having something go wrong while you’re getting your morning coffee. These teams identified gaps in their run books and on-call rotations and fixed them. Success!
Another interesting thing: we found that after teams got over their initial discomfort, they automated fixes that had previously been done manually and prioritize technical debt items in their backlog correctly, because the failures causing them had been so infrequent beforehand. This, in turn, caused those teams to have more confidence in their services’ reliability.
Unfortunately, ChaosCat is significantly tied into our internal infrastructure tooling. For the moment this means we won’t be open-sourcing it. However, we’d love to get your feedback and questions about it, so ask away in the PagerDuty Community forums or in the comments below!
We hope that more companies start to practice this kind of reliability engineering — or as some like to say, chaos engineering — it’s a fantastic way to verify the robustness and behavior of increasingly complex and diverse infrastructure.
This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...
At the latest PagerDuty Connect event in Toronto, DevOps expert Arthur Maltson shared a recent story about chaperoning his daughter’s school field trip to a...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018