Chaos Monkey is a tool that Netflix created to turn the volume on system testing to 11. If it is vital for you to be fault tolerant and flexible, you need to prove that it the case, and the only real way to do this is to break things in production and see what happens.

There is obviously a cost to making something fault tolerant, and even more so to run these tests and keep tweaking the entire system, but this cost hedging the risk of not doing so.

When Adam Solove tweeted about “Chaos Monkey People” it was funny, but it hit a nerve. It may seem crazy to slow down teams in this way. We have so much that we have to do for our users! Features! Bugs! However it can also be argued that this is short term thinking.

If you are focused on the long term stability of your company, or team, then doing something like this may not be so crazy.

It also has interesting side effects. Beyond having random people just leave (although more vacation would actually be good imo!) there is also the notion of switching roles.

By switching within a discipline you can learn new skills (e.g. front-end / server side engineering) and that also holds across disciplines. What can be truly magical is the effect on empathy. Have you ever seen people who think another role is easier? Or another team is slower than they should be? By switching things up you get to see things from the other side and I would very much bet that your mind will change in some way (it may solidify a thought, but at least there is more data behind it!)

If a team has more empathy, they will work better together. Trust and respect has more of a chance of kicking in, and this will have a huge effect.

By creating these opportunities not only do you spread out knowledge, but you may also find hidden talent.

Even if you don’t introduce the mini bus factor of Chaos Monkeying your team, it is well worthwhile thinking about how your team can be setup for long term success.