Based on their experience with arbitrarily shutting down servers or simulating the shutdown of an entire data center in production, Netflix has proposed a number of principles of chaos engineering.

Netflix defines Chaos Engineering as the “discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Netflix affirms the need for finding weaknesses in a production system before they manifest in undesired ways by observing the system’s behavior in controlled experiments. They outline a number of possible systemic weaknesses: “improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes, etc..”

The 4 principles of Chaos Engineering, according to Netflix, are:

Build a Hypothesis around Steady State Behavior Focus on the measurable output of a system, rather than internal attributes of the system. Measurements of that output over a short period of time constitute a proxy for the system’s steady state. The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady state behavior. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works. Vary Real-world Events Chaos variables reflect real-world events. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment. Run Experiments in Production Systems behave differently depending on environment and traffic patterns. Since the behavior of utilization can change at any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic. Automate Experiments to Run Continuously Running experiments manually is labor-intensive and ultimately unsustainable. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.

In short, Netflix suggests the following practical steps:

Defining what is the normal behavior of a system, its “steady state” Build a control system and an experimental one Start forcing disruptions on the experimental system, simulating real life events such as server crashes, HDD malfunctioning, network failures, etc. Compare the steady state of the experimental system against the control one. The less it deviates from normal, the more confidence one has in the resilience of his system. If problems appear during these experiments, one can learn from what happens and take appropriate measures.

The Principles of Chaos Engineering is meant to be a living document, Netflix inviting other organizations to contribute to it.

Netflix has a long experience building and using tools such as Chaos Monkey, Gorilla and Kong to test how their services behave when various systems, a zone or an entire region is taken down. Zone outages are very unlikely, according to Nir Alfasi, a Netflix engineer, so Gorilla is not really used, but they practice region outages using Kong almost every month. The Chaos Monkey has been open sourced some time ago under the Simian Army project. Other tools in the Simian Army are Janitor Monkey and Conformity Monkey.