Failure is the last thing you want when running a huge network, particularly one that supports a multi-billion dollar business. But preventing failure requires practice and good planning—and that's why Netflix developed software that attacks its own network more than 1,000 times a week.

By forcing Netflix engineers to recover from small failures that customers won't notice, the company hopes to prevent major outages in its video streaming service. Netflix calls the software it built to automate the process of causing failure a "Chaos Monkey," and today announced the release of Chaos Monkey's source code onto GitHub under the Apache License.

"We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient," Netflix engineer Cory Bennett and executive Ariel Tseitlin wrote in the Netflix tech blog today.

Like many businesses, Netflix hosts its infrastructure on the Amazon Web Services cloud. This allows companies to build out huge clusters of servers and storage without operating their own data centers, but it doesn't insulate them from failure. Businesses that run infrastructure on Amazon have to think about what happens both when Amazon services suffer outages and when their own software causes downtime.

Netflix's Chaos Monkey is "a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact," Netflix explained. "The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables—all the while we continue serving our customers without interruption."

Specifically, the Chaos Monkey randomly terminates virtual machines Netflix operates in Amazon's Auto Scaling service. In the past year, Netflix says its Chaos Monkey "has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again."

The Auto Scaling technology on Amazon's cloud should detect the termination of an instance and automatically configure a new, identical one to replace it. But the Chaos Monkey's random attacks can still suss out problems, like a patch gone wrong or a traffic load balancer that's failing to route requests around offline instances. While Netflix uses the Chaos Monkey on Amazon, it's flexible enough that it can be installed on other public cloud networks. By default, it only runs during business hours, so people are around to clean up the Chaos Monkey's mess when it identifies a serious problem.

Amazon's cloud infrastructure is divided into data center regions (like the East Coast or West Coast), which in turn are divided into availability zones. Customers are more likely to survive Amazon failures if they build systems that can fail over across availability zones or regions. Building across regions is the most expensive option, but also the most resilient, as failures have occurred across multiple availability zones on numerous occasions.

Last year, customers like reddit, Foursquare, and Quora experienced first-hand what can happen when multiple availability zones are hit with the same problem. Just last month, a power outage followed by the failure of Amazon's primary, backup, and secondary backup power systems took down many virtual machines and storage volumes in Amazon's East coast region. And yes, even Netflix was taken offline by another outage at the end of June.

As such, Netflix's error detection efforts have to go beyond the scale of individual virtual machines. Netflix detailed its Chaos Monkey one year ago in a blog post that also revealed plans for various other chaos-inducing "monkeys." There's a Latency Monkey that introduces artificial delays into Netflix's REST-ful client-server communication layer to simulate service degradation, and a Conformity Monkey that shuts down instances that don't adhere to best practices. There's even a Chaos Gorilla that acts like a Chaos Monkey but simulates an outage of an entire Amazon availability zone.

While the Chaos Monkey is available to anyone who wants it today, there's no word yet on when or whether any of Netflix's other monkeys will be released into the wild. A posting on GitHub describes the Chaos Monkey as the "first member" of Netflix's Simian Army.