What is Chaos Engineering?

Chaos engineering and resiliency are tightly coupled. While Chaos Engineering may sound like a bad oxymoron, its concepts have built the invisible essence and swift usability of many modern day technology architectures, while representing concepts of resiliency within a tech industry setting. With the rising risk of failure associated with distributed cloud architectures and microservices, chaos engineering processes are now frequently touted as “preventative medicine” to these failures. Put simply, chaos engineering is the test crash before new year cars go on the lot, or the “freedom to fail” before customers get a chance to see the failure (Though in this case, only jobs are at risk).

Let’s get into the process of a chaos engineering experiment. In the following article we’ll outline how to implement an informed chaos engineering process, discuss a few specifics around tools and automation, and hopefully reduce some hesitation around implementing chaos engineering within your team.

Benefits of Chaos Engineering

The primary benefit of chaos engineering is that its process and results not only address planned failures, but unplanned failures, too. Unplanned outages are usually due to system complexity. For example, it is no longer feasible to try to map out all possible combinations and outcomes of complex systems, so it’s more efficient to map out the initial knowns, and then use chaos experiments to reveal more potential causes for outages, latencies, silent failures, and so on. In fact, in the event an unexpected failure does take place during a chaos experiment, it’s considered a big win. Noted by Netflix’s Chaos Monkey’s tool maintainers on GitHub, “Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment.” By predicting the unexpected, chaos engineers will feel more confident in their systems while preventing revenue loss, system failures, and customer dissatisfaction.

An ancillary benefit of chaos engineering is simply new knowledge. While we want to see new failures for the sake of preventing them, these experiments and trials will also show us what is working, and what could work in the future.

Getting Started

Chaos monitoring does not have to be scary, as Chaos Engineer Ana Medina at Gremlin emphasizes. In fact, embracing failure while predicting uncertainties will create a much more productive and growth-oriented environment. Further, the principles of Chaos Engineering indirectly address a blameless culture that invokes fear around failure. In Chaos Engineering, not only are you encouraged to fail in order to succeed again, but there simply are no repercussions – it’s a test run!

Now that we’ve covered the mindset, we can address the more tangible starting necessities: get together with your team to gather your services, dependencies, and data stores.

Blast Radius

If it’s possible to start your experiment in a non-production environment, that is your safest bet. Often teams need to be treating these systems in real time. As such, minimizing your blast radius — the zone of systems and therefore users affected — is paramount.

For example, instead of running a chaos experiment against all of your infrastructure, identify the areas that have minimal risk to the business but are still similar enough to the rest of your infrastructure for a direct comparison. In short the blast radius should be minimal — while you are aiming for holes in your system, your business operations should be kept whole and intact.

Time to Experiment & Implement

Next, consider what type of experiment you will be doing. Below are essential questions to ask before proceeding, adapted from John Welsh, Cloud Infrastructure, and SRE Disaster Recovery at Google.

  • What does my current system look like right now?
  • What are our current challenges and system failures?
  • What is the goal of the test?
  • What is the risk? What could go wrong?
  • What is the impact? What could go really wrong?
  • What are the mitigations?

Define a Steady State

It’s pivotal to understand and define your steady state utilizing metrics that indicate that your systems are operating as they should, aligned with business goals and standards.

Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event.

Develop your Hypothesis

Choose your hypothesis based on the measurements of output representing or proxying steady state behavior, such as throughput, error rates, and latency percentiles. It’s important to note that hypotheses are trying to detect real unknowns, rather than error occurrences you already know to be predictable. An example hypothesis might be: : “When X occurs, the steady state of our system remains stable.” It is ok if this hypothesis fails in the beginning of the chaos experiment as the goal is to resolve the errors, alerts, and so on so that it becomes true.

Introduce Realistic Stressors

It’s time to introduce your ‘chaos.’ Your aim is to stimulate realistic stressors — or variables — that will alter your steady state toward latency, degraded performance, or failure. Consider events that could occur within hardware and software, like server crashes or malformed responses, or external events such as traffic spikes or severed connections. These stressors will then provide necessary information about the resilience of your systems.

Validate your Hypothesis

If you’ve found differences between your steady state metrics and those after you’ve initiated your stressors, your chaos experiment was successful. At this point you can identify where you need to strengthen for large scale scenarios. Alternatively, you may also consider no statistically significant difference between metrics a ‘win’— you can walk away with confidence in your existent structure.

Automation in Chaos Engineering

Both at the start of chaos experimentation, and when faced with complex and perhaps expensive system upgrades, manual failure testing is typically the route to go down. However for advanced chaos engineers, automated failure testing can provide benefits manual implementation cannot.

Automated experiments will increase confidence in systems by providing a consistent source of data at regulated times, and “at scale.” It can further save teams money and time, which allows engineers to experiment with more complex systems manually.

Chaos Engineering Tools

Below is a brief list outlining the most common tools available, each with their own benefits and limitations. It’s important to carefully vet your tools and align them with your system and business goals before commencing an experiment.

Gremlin

Gremlin is a major player in the Chaos Engineering space, as they provide likely the most user-friendly, transparent tool. They help software engineers create safe and resilient systems with highly manipulatable infrastructure attack models.

Application Specialty: Most user-friendly for public cloud or personal data.

Chaos Monkey

The first and most widely known chaos tool, Chaos Monkey was released to the world in 2012 by Netflix and today has the longest development history. It requires users to use only Spinnaker to manage applications, and some say it has limited control over blast radius.

Application Speciality: Cloud system failures.

Chaos Blade

Created by Alibaba, Chaos Blade supports a wide range of operating platforms, provides dozens of attack targets, and supports application-level fault injection.

Application Speciality: Based on failures of 10 years at Alibaba; distributed system fault tolerance.

Chaos Mesh

Chaos Mesh is a Kubernetes-native tool that lets you deploy and manage your experiments as Kubernetes resources, also known as a Cloud Native Computing Foundation (CNCF) sandbox project. It supports 17 unique attacks, and it allows you to fine-tune your blast radius by disrupting pod-to-pod communication.

Application Specialty: Designed to easily kill Kubernete pods & simulates latencies.

Litmus

Litmus is also a Kubernetes-native tool and a CNCF sandbox project and allows for an easy plug and play with a chaos Operator and the CRDs (CustomResourceDefinitions).

Application Specialty: Complete framework within the Kubernetes system; ideal for Site Reliability engineers.