Building Your own Chaos Monkey

By David Mytton,

CEO & Founder of Server Density.

Published on the 3rd September, 2015.

In 2012 Netflix introduced one of the coolest sounding names into the Cloud vernacular.

What Chaos Monkey does is simple. It runs on Amazon Web Services and its sole purpose is to wipe out production instances in a random manner.

The rationale behind those deliberate failures is a solid one.

Setting Chaos Monkey loose on your infrastructure—and dealing with the aftermath—helps strengthen your app. As you recover, learn and improve on a regular basis, you’re better equipped to face real failures without significant, if any, customer impact.

Our monkey

Since we don’t use AWS or Java, we decided to build our own lightweight simian in the form of a simple Python script. The end-result is the same. We set it loose on our systems and watch as it randomly seeks and destroys production instances.

What follows is our observations from those self-inflicted incidents, followed by some notes on what to consider when using a Chaos Monkey on your infrastructure.

Design Considerations

1. Trigger chaos events during business hours

It’s never nice to wake up your engineers with unnecessary on-call events in the middle of the night. Real failures can and do happen 24/7. When it comes to Chaos Monkey, however, it’s best to trigger failures when people are around to respond and fix them.

2. Decide how much mystery you want

When our Python script triggers a chaos event, we get a message in our HipChat room and everyone is on the look out for strange things.

The message doesn’t specify what the failure is. We still need to triage the alerts and determine where the failures lie, just as we would in the event of a real outage. All this “soft” warning does is lessen the chance of failures going unnoticed.

3. Have several failure modes

Killing instances is a good way to simulate failures but it doesn’t cover all possible contingencies. At Server Density we use the SoftLayer API to trigger full and partial failures alike.

A server power-down, for example, causes a full failure. Disabling networking interfaces, on the other hand, causes partial failures where the host may continue to run (and perhaps even send reports to our monitoring service).

4. Don’t trigger sequential events

If there’s ever a bad time to set your Chaos Monkey loose, that’s during the aftermath of previous chaos event. Especially if the bugs you discovered are yet to be fixed.

We recommend you wait a few hours before introducing the next failure. Unless you want your team firefighting all day long.

5. Play around with event probability

Real world incidents have a tendency to transpire when you least expect them. So should your chaos events. Make them infrequent. Make them random. Space them out, by days even. That’s the best way to test your on-call readiness.

Initial findings

We’ve been triggering chaos events for some time now. None of the issues we’ve discovered so far were caused by the server software. In fact, scenarios like failovers in load balancers (Nginx) and databases (MongoDB) worked very well.

Every single bug we found was in our own code. Most had to do with how our app interacts with databases in failover mode, and with libraries we’ve not written.

In our most recent Chaos run we experienced some inexplicable performance delays during two consecutive MongoDB server failovers. Rebooting the servers was not a viable long term fix as it results in a long downtime (>5 minutes).

It took us several days of investigation until we realised we were not invoking the mongoDB drivers properly.

The app delays caused by the Chaos run happened during work hours. We were able to look at the issue immediately, rather than wait until an on-call engineer gets notified and is able to respond, in which case the investigation would’ve been harder.

Such discoveries help us report bugs and improve the resiliency of our software. Of course, it also means additional engineering hours and effort to get things right.

Summary

The Chaos Monkey is an excellent tool to test how your infrastructure behaves under unknown failure conditions. By triggering and dealing with random system failures, you help your product and service harden up and become resilient. This has obvious benefits to your uptime metrics and overall quality of service.

And if the whole exercise has such a cool name attached to it, then all the better.

Editor’s note: This post was originally published on 21st November, 2013 and has been completely revamped for accuracy and comprehensiveness.