Chaos Engineering & Testing

Modern software systems are more complex and interconnected than ever. While this creates limitless possibilities for innovation, it also introduces more potential failure points, many of which can go undetected until something breaks in production.

That’s where chaos engineering comes in. By deliberately injecting failure into systems in a controlled way, teams can identify vulnerabilities, improve system resilience, and ensure a better user experience before real issues arise.

Chaos testing originated at Netflix over a decade ago. As the company began migrating its infrastructure to Amazon Web Services (AWS), engineers faced a critical challenge: ensuring reliability at scale during a massive shift from monolith to microservices.

The Netflix team embraced a bold new mindset to minimize disruption for its users, intentionally injecting failure to better understand and strengthen their system. 

Key Takeaways

  • Chaos engineering proactively tests system resilience by injecting controlled failures to uncover hidden weaknesses.
  • Chaos testing, a core tactic, simulates real-world incidents like server crashes or latency spikes to validate system reliability.
  • The process follows key principles: define steady state, hypothesize stability, minimize user impact, introduce chaos, and monitor results.
  • Tools like Netflix’s Chaos Monkey automate fault injection, helping teams improve reliability in complex, distributed systems.
  • Chaos engineering is most valuable for large-scale or cloud-native environments and requires mature DevOps or SRE practices.

What is chaos engineering?

Chaos testing, or chaos engineering, is the highly disciplined approach to testing a system’s integrity by proactively simulating and identifying failures in a given environment before they lead to unplanned downtime or a negative user experience. DevOps and IT teams can use chaos engineering tools to simulate fault injection and observe system behavior in a production environment. 

By intentionally breaking things, chaos engineers can validate monitoring tools, stress-test infrastructure, and enhance system resilience while preserving the user experience.

Chaos testing vs. chaos engineering

Chaos testing is an integral part of chaos engineering. It involves simulating real-world incidents, like server crashes or latency spikes, to see how your system performs under stress.

While the terms are sometimes used interchangeably, it’s helpful to understand how they relate:

Similarities

  • Both aim to identify weaknesses in complex, distributed systems before those issues reach end users.
  • Both involve deliberate experimentation with faults and failure scenarios.
  • Both emphasize building a resilient system that can recover quickly and operate reliably under disruption.

Differences

  • Chaos engineering is a broader discipline. It includes defining hypotheses, setting guardrails, monitoring, and learning from experiments to improve system design.
  • Chaos testing refers to the execution phase—running controlled experiments that simulate failures.

Chaos engineering is the strategy and chaos testing is one of the key tactics used to carry it out.

Principles of chaos engineering

Chaos engineering is made up of five core principles:

  • Ensure your system works and define a steady state. To do this, you’ll need to define a “steady state” or control as a measurable system output that indicates normal working behavior (well below a one percent error rate).
  • Hypothesize that the system’s steady state will hold. Once you’ve established a steady state, hypothesize that it will continue in both control and experimental conditions.
  • Ensure minimal impact to your users. During chaos testing, the goal is to actively try to break or disrupt the system, but it’s important to do so in a way that minimizes the blast radius and any negative impact to your users. Your team is responsible for ensuring all tests are focused on specific areas and should be ready for incident response as needed.
  • Introduce chaos. Once you are confident that your system is working, your team is prepared, and the blast radius is contained, you can start running your chaos testing applications. You’ll want to introduce different variables intending to simulate real-world scenarios, including everything from a server crash to malfunctioning hardware and severed network connections. It’s best to test in a production environment to monitor how your service or application would react to these events without directly affecting the live version and active users.
  • Monitor and repeat. With chaos engineering, the key is to test consistently, introducing chaos to pinpoint any weaknesses within your system. The goal of chaos engineering is to disprove your hypothesis from number two, building a bulletproof, more reliable system in the process.

Examples of chaos tests

Chaos testing has real-world applications across industries where reliability is critical. By simulating failure scenarios in a controlled environment, teams can identify weak spots before they lead to service disruptions, compliance issues, or frustrated customers.

Chaos testing in finance

Simulating third-party API outages: Financial institutions often rely on external services for payments, market data, or identity verification. A chaos experiment might involve disabling a payment gateway or slowing a data feed to observe how the system responds. 

Can transactions be queued? Are customers notified? These tests support system reliability under unpredictable conditions.

Chaos testing in healthcare

Testing EHR access delays: Rapid access to electronic health records (EHRs) is essential in healthcare. A chaos test might inject latency or failure to confirm that providers can still access critical data via backups or caching. 

This type of testing can help to ensure system resiliency during emergencies.

Chaos testing in the public sector

Disrupting DNS or authentication services: Government platforms must stay online during critical events. A chaos test might simulate a DNS failure or compromised authentication system. By analyzing system behavior during these events, chaos engineers can improve uptime and access.

What is Chaos Monkey, and how does it work?

When Netflix began chaos testing its system during the AWS migration, it created a series of tools known as chaos monkeys to introduce continuous, controlled disruption. These tools simulate unexpected failure in production environments by randomly disabling infrastructure components to test system recovery. 

Netflix’s “Simian Army” includes:

  • Latency monkey: Induces artificial delays
  • Conformity and security monkeys: Shut down instances that don’t meet standards
  • Janitor monkey: Cleans up unused resources
  • Chaos gorilla: Simulates an entire Amazon availability zone outage
  • Chaos kong: Takes out entire AWS regions to test global resilience

Pros and cons of chaos engineering

Chaos engineering has become a go-to strategy for DevOps teams managing complex, distributed systems. But it’s not for everyone.

Pros of chaos testing

  • Uncovers hidden weaknesses that traditional tests may miss
  • Reduces unplanned downtime with proactive fault injection
  • Builds system resilience through repeated controlled failures
  • Ideal for large-scale or cloud-native environments with 24/7 uptime requirements

Cons of chaos testing

  • May be excessive for smaller systems or non-critical apps
  • Requires a mature DevOps or site reliability engineering (SRE) culture
  • Needs proper observability and safeguards to avoid user-facing incidents

How chaos testing fits into DevOps workflows

Chaos engineering naturally fits into DevOps and SRE workflows, where system reliability and automation are key. In many organizations, chaos engineering work falls to a DevOps engineer, site reliability engineer, or Experience Assurance (XA) Professional.

These roles are responsible for:

  • Defining failure scenarios
  • Running controlled experiments
  • Monitoring system behavior and response
  • Ensuring minimal impact on users

Rather than causing random outages, chaos engineers use controlled experiments to test system design and resilience. Most teams start with staging environments before carefully introducing chaos into production systems using chaos dashboards and alerts to track results.

How to get started with chaos testing

Curious to get started with chaos testing? Before rushing out an army of chaos monkeys, it’s essential to determine whether chaos testing and engineering are right for your team and company. Chaos engineering has proven to be highly effective at improving the integrity of large and complex systems, offering benefits such as faster incident response times, less unplanned downtime, and ultimate flexibility in scaling up and out. However, chaos testing may not be necessary for smaller systems or desktop software.

Want to learn more about chaos engineering and how to begin implementing it within your organization, contact us online or start your 14-day free trial today.