I'd hazard a guess that more than half of the money we spend on IT infrastructure goes to ensuring uptime. Whether that takes the form of redundant power supplies, redundant disks, multicore network architectures, high-availability server clusters, or fully replicated backup data centers, we build our infrastructures not "just in case" of failure, but because we know failure will eventually occur.

And we're usually not disappointed. I've lost count of how many failed SAN disks and controllers, servers, and network equipment I've seen -- probably thousands over the years. Doesn't matter who made the equipment -- it could be the biggest, most respected name in storage or networking -- it all breaks at some point or another.

[ Learn how data deduplication can slow the explosive growth of data with Keith Schultz's Deep Dive Report. | Looking to revise your storage strategy? See InfoWorld's iGuide to the Enterprise Data Explosion. ]

We're now at a point, however, where failure has become much more complicated. As storage has moved from simple, striped disk volumes to virtualized storage platforms and deduplication, the old standby of throwing more redundant hardware at the reliability problem doesn't always cut it. Every time you add a new feature to optimize or more effectively manage storage, you need more software to drive all the hardware. Often, that software is where the worst problems arise.

The case of the disappearing cartridge Recently, I worked with a client to design and implement a fairly complex new backup architecture. One key component of that configuration was a fairly large, high-performance VTL (Virtual Tape Library). The VTL was chosen as part of the design because it offered easy integration with the client's existing backup infrastructure and provided excellent deduplication capabilities -- essentially allowing them to keep months of backups in a nearline state where they could be easily and quickly restored -- rather than the few days that their previous, non-deduplicated disk-to-disk to tape solution had allowed.

Recently, I worked with a client to design and implement a fairly complex new backup architecture. One key component of that configuration was a fairly large, high-performance VTL (Virtual Tape Library). The VTL was chosen as part of the design because it offered easy integration with the client's existing backup infrastructure and provided excellent deduplication capabilities -- essentially allowing them to keep months of backups in a nearline state where they could be easily and quickly restored -- rather than the few days that their previous, non-deduplicated disk-to-disk to tape solution had allowed.

Last week, the VTL decided for some reason that one of the emulated tape cartridges was corrupt. There was no real indication as to why; the hardware seemed to be working fine. Obviously, that didn't inspire confidence, but hey, things break. I've seen a few corrupted physical tape cartridges in my time. You learn to have a backup plan for your backup plan and roll with it.

The manufacturer's first-line support suggested that the client delete that cartridge and then reboot the VTL (fourth-level support and engineering would later say this was the last thing you should do, but that's an entirely different topic). Having done that, the VTL couldn't bring the affected virtual library online at all. That's because that cartridge -- really a massive collection of deduplicated data blocks on a disk array -- wasn't deleted cleanly and had taken with it all of the blocks that had been in common with many of the other virtual tape media. That one poorly conceived troubleshooting step had essentially rendered 20TB of backups useless in one fell swoop.

Oops.