What is Chaos Testing?

Articles

Chaos testing was created just over ten years ago thanks to the same company that gave us Tiger King and The Queen’s Gambit—Netflix.

In 2010, development and operations teams at Netflix started the process of moving their entire infrastructure over to AWS (Amazon Web Services). At the time, the team at Netflix quickly realized their existing infrastructure would not allow for the scalability that they’d eventually need, so they made the intimidating decision to migrate everything to Amazon’s cloud-based AWS in a monolith-to-microservice transition.

During this time, Netflix established two principles learned from the process of moving over their entire infrastructure while minimizing the impact to its millions of users:

No system should ever have a single point of failure. A single point of failure refers to the possibility that one error or failure could lead to hundreds of hours of unplanned downtime.
Never be 100% confident that number one is true. Your team needs an effective way to consistently test and monitor your system to ensure point number one is true (Netflix created chaos monkeys to help handle this—more on that later).

This methodology was called chaos testing. Chaos testing relies on the proactive identification of errors within a system in order to prevent outages and negative impacts on the user. Netflix understood the importance of this all too well, as they had experienced a catastrophic failure just a few years prior to making the switch to AWS.

Today, many DevOps and IT teams in all industries are joining Netflix and Amazon in adopting chaos testing and engineering. In this article, we will take a closer look at the core principles of chaos engineering, its advantages and disadvantages, chaos monkeys, and whether chaos testing is a good fit for your team.

What is Chaos Testing Engineering?

Chaos testing, or chaos engineering, is the highly disciplined approach to testing a system’s integrity by proactively simulating and identifying failures in a given environment before they lead to unplanned downtime or a negative user experience. DevOps and IT teams that utilize chaos engineering will need to set up a system of monitoring tools and actively run chaos testing in a production environment. This way, teams are able to see real-life simulations of how their application or service responds to different pressures and stresses.

Chaos engineering is made up of five main principles:

Ensure your system works and define a steady state. In order to do this, you’ll need to define a “steady state” or control as a measurable system output that indicates normal working behavior (well-below a one percent error rate).
Hypothesize the system’s steady state will hold. Once a steady state has been determined, it must be hypothesized that it will continue in both control and experimental conditions.
Ensure minimal impact to your users. During chaos testing, the goal is to actively try to break or disrupt the system, but it’s important to do so in a way that minimizes the blast radius and any negative impact to your users. Your team will be responsible for ensuring all tests are focused on specific areas and should be ready for incident response as needed.
Introduce chaos. Once you are confident that your system is working, your team is prepared, and the blast radius is contained, you can start running your chaos testing applications.. You’ll want to introduce different variables with the intention of simulating real world scenarios, including everything from a server crash to malfunctioning hardware and severed network connections. It’s best to test in a production environment so you can monitor how your service or application would react to these events without directly affecting the live version and active users.
Monitor and repeat. With chaos engineering, the key is to test consistently, introducing chaos to pinpoint any weaknesses within your system. The goal of chaos engineering is to disprove your hypothesis from number two, building a bulletproof, more reliable system in the process.

What is Chaos Monkey and How Does it Work?

When Netflix started chaos testing their system during their move to AWS, they created different “chaos monkeys” to help meet the need of continuous and consistent testing. These chaos monkeys were deployed into a system to introduce specific issues—network delays, instances, missing data segments, etc—and simulate different real-world scenarios.

Each chaos monkey had its own name and job, including:

Latency Monkey: Induces artificial delays
Conformity and Security Monkeys: Hunt and kill instances that don’t adhere to best practices
Janitor Monkey: Cleans up and removes unused resources
Chaos Gorilla: Simulates an entire Amazon availability zone outage

Collectively, these and more chaos monkeys are now known as Simian Army.

The Advantages and Disadvantages of Chaos Testing?

Chaos engineering is gaining popularity with some of the industry’s largest IT and DevOps teams. However, it’s not always the right choice for every team and situation.

The advantages of chaos testing are:

IT and DevOps teams are able to more quickly identify and resolve issues that might not be captured with other testing
Unplanned downtime and outages are far less likely to occur due to proactive and constant testing
Strengthens system integrity
Great for large, complex systems (ie: cloud-based applications and services) as well as for scaling up

However, chaos testing may not be right for:

Smaller systems or desktop software
Applications and services that are not mission-critical to the success of the business
Application environments that don’t require 24×7 uptime via customer SLAs
Systems in which failures are acceptable if resolved by the end of the day

How Does Chaos Testing Work in DevOps?

Chaos engineering fits well within a DevOps structure. Typically, chaos engineering falls on the shoulders of a DevOps engineer such as the XA (Experience Assurance Professional). This person is in charge of defining the different testing scenarios, executing the tests, and tracking the outcome and results. They are also responsible for ensuring minimal impact to the customer.

While testing, there’s a very fine line that the DevOps engineer must walk. One on side, there’s testing the system’s integrity by introducing chaos and trying to get it to crash (hence, why this is best done in a production environment). On the other, there’s conducting unplanned or undisciplined tests that actually cause the system to crash and affect user experience.

How to Get Started with Chaos Testing

Curious to get started with chaos testing of your own system? Before rushing out an army of your own chaos monkeys, it’s important to first determine whether chaos testing and engineering is right for your team and company. Chaos engineering has proven to be extremely effective at improving the integrity of very large and complex systems, offering benefits such as faster incidence response times, less unplanned downtime, and ultimate flexibility in terms of scaling up and out. However, chaos testing may not be necessary for smaller systems or desktop software.

If you would like to learn more about chaos engineering and how you can begin implementing it within your organization, please do not hesitate to contact us online or start your 14-day free trial today.

Additional
Resources

Report

Digital Operations in 2024 | PagerDuty

EBook

Achieve Operational Resilience in the Cloud with PagerDuty and AWS

Recent
Blog Posts

Expanding Critical Services with the PagerDuty Operations Cloud

The role of psychological safety in incident response

Build More Resilient Operations with PagerDuty Incident Management

Incident Management

AIOps

Process Automation

Customer Service Ops

Status Pages

Stakeholders Communications

Integrations

PagerDuty Copilot

Developer Platform

Professional Services

Security

Enterprise Class

Integrations