Hardware failure is a fact of life when it comes to computers, which is at odds with trying to keep a service running 24/7. No one can guarantee absolute perfect uptime, but it’s possible to get pretty darned close. If you design things solidly and are willing to throw money at the problem, you can make fantastically reliable systems that suffer very little downtime due to hardware failures.

Engineers talk about this in terms of “nines” – two nines is 99% uptime, four nines is 99.99% uptime, and so on for any other number of nines. Anchor, for example, regularly achieves better than three nines (99.9%) per month, approaching four nines.

This sort of reliability doesn’t come cheap. For each extra nine that you add, you’re reducing the downtime by a factor of ten. As a very rough rule of thumb, each extra nine costs you about ten times as much.

Clearly it doesn’t make sense to pay for five nines of uptime (less than 26sec of downtime per month!) when it’ll cost more than your business earns in an entire year. What we’re really talking about is risk management.

Risk management is complex, but the concept is simple: risks can be mitigated for some amount of money, and bigger risks will cost more money. Some risks are worth worrying about – it’s a good idea to keep extra copies of important documents because it’s a big hassle if you lose them and duplicates are really cheap. Some risks aren’t worth worrying about – if you just need to get to work on time, owning a second car in case the first one breaks down probably isn’t worthwhile.

One day during a support call, an Anchor sysadmin thought to explain the risks thusly:

“Your webserver sometimes gets hit by a meteorite, which results in downtime – that’s risk.

How badly does it affect your business?

It depends how big your meteorite is.”

It’s cute and was well received so we held onto it, but we also felt it could be a little more illustrative when it comes to risk mitigation. So we came up with this:

Small meteorites are easy to comprehend, and they’re a no-brainer to guard against. The big ones are complex and might need serious analysis to fully understand the mitigation (but really, if you’re worried about your website when the whole city is taken offline by a tsunami, you’ve probably already done this).

Was this clear? Did you find it helpful? Tell us what you think, we’d love to hear from you!