In 2010 Netflix announced the existence and success of their custom resiliency tool called Chaos Monkey .

What is Chaos Monkey?

In 2010, Netflix decided to move their systems to the cloud. In this new environment, hosts could be terminated and replaced at any time, which meant their services needed to prepare for this constraint. By pseudo-randomly rebooting their own hosts, they could suss out any weaknesses and validate that their automated remediation worked correctly. This also helped find “stateful” services, which relied on host resources (such as a local cache and database), as opposed to stateless services, which store such things on a remote host. Netflix designed Chaos Monkey to test system stability by enforcing failures via the pseudo-random termination of instances and services within Netflix’s architecture. Following their migration to the cloud, Netflix’s service was newly reliant upon Amazon Web Services and needed a technology that could show them how their system responded when critical components of their production service infrastructure were taken down. Intentionally causing this single failure would suss out any weaknesses in their systems and guide them towards automated solutions that gracefully handle future failures of this sort.

Chaos Engineering Is

"the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production."

Chaos Monkey helped jumpstart Chaos Engineering as a new engineering practice. Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds to failure conditions, you can identify and fix failures before they become public facing outages. Chaos Engineering lets you validate what you think will happen with what is actually happening in your systems. By performing the smallest possible experiments you can measure, you’re able to “break things on purpose” in order to learn how to build more resilient systems.

In 2011, Netflix announced the evolution of Chaos Monkey with a series of additional tools known as The Simian Army. Inspired by the success of their original Chaos Monkey tool aimed at randomly disabling production instances and services, the engineering team developed additional “simians” built to cause other types of failure and induce abnormal system conditions. For example, the Latency Monkey tool introduces artificial delays in RESTful client-server communication, allowing the team at Netflix to simulate service unavailability without actually taking down said service. This guide will cover all the details of these tools in The Simian Army chapter.

What Is This Guide?

The Chaos Monkey Guide for Engineers is a full how-to of Chaos Monkey, including what it is, its origin story, its pros and cons, its relation to the broader topic of Chaos Engineering, and much more. We’ve also included step-by-step technical tutorials for getting started with Chaos Monkey, along with advanced engineering tips and guides for those looking to go beyond the basics. The Simian Army section explores all the additional tools created after Chaos Monkey.

This guide also includes resources, tutorials, and downloads for engineers seeking to improve their own Chaos Engineering practices. In fact, our alternative technologies chapter goes above and beyond by examining a curated list of the best alternatives to Chaos Monkey — we dig into everything from Azure and Docker to Kubernetes and VMware!

Who Is This Guide For?

We’ve created this guide primarily for engineers who are looking for an in-depth resource on Chaos Monkey, as a way to get started with Chaos Engineering. We want to help readers see how Chaos Monkey fits into the practice of Chaos Engineering.

Why Did We Create This Guide?

Gremlin’s goal is to empower engineering teams to build more resilient systems through thoughtful Chaos Engineering. We’re on a constant quest to promote the Chaos Community through frequent conferences & meetups, in-depth talks, detailed tutorials, and the ever-growing list of Chaos Engineering Slack channels.

While Chaos Engineering extends well beyond the scope of one single technique or idea, Chaos Monkey is the most well-known tool for running Chaos Experiments and is a common starting place for engineers getting started with the discipline.