Smashing all known records by a multiple of 10, IBM Research Almaden, California, has developed hardware and software technologies that will allow it to strap together 200,000 hard drives to create a single storage cluster of 120 petabytes — or 120 million gigabytes. The drive collective, when it is complete, is expected to store one trillion files — or to put it in Apple terms, two billion hours of MP3 music.

The data repository, which currently has no name, is being developed for an unnamed customer — but with a capacity of 120PB, its most likely use will be the storage device for a government-owned or federally-funded supercomputer or other high-performance computing (HPC) application; 120PB is the kind of capacity that you need to store global weather models or infinitely detailed weapon system simulations, both of which are rarely carried out by commercial interests. Alternatively, it could be used to store a large portion of the internet (or data about its users) for Google or Facebook, or another client with very deep pockets. The largest systems currently in existence are generally around 15 petabytes — though, as of 2010, Facebook had a 21PB Hadoop cluster, and by now it’s probably significantly larger.

Exact details about the software and hardware isn’t given by IBM, but we do know that it features a new-and-updated version of IBM’s General Parallel File System (GPFS). GPFS is a volume-spanning file system which stores individual files across multiple disks — in other words, instead of reading a multi-terabyte high-resolution model at 100MB/sec from a single drive, the same file can be read in a massively parallel fashion from multiple disks. The end result is read/write speeds in the region of several terabytes per second — and, as a corollary, the ability to create more than 30,000 files per second. GPFS also supports redundancy and fault tolerance: when a drive dies, its contents are rebuilt on a replacement drive automatically by the governing computer.

On the hard drive side of things, if we divide 120PB by 200,000 you get 630GB — and once you factor in redundancy, it’s fairly safe to assume that the drives are all 1TB in size. We also know that every single one of the 200,000 drives will be watercooled with presumably the largest and most complicated bit of plumbing ever attempted — but considering IBM’s penchant for watercooling its top-end servers, that’s hardly surprising (though we still hope to post a photo of the system once it’s complete).

As it stands, supercomputers — and large-scale science experiments like the LHC — can produce (and compute) far more data than can be feasibly stored. IBM’s system, it is hoped, will be a data repository that goes some way to bridge the gulf between silicon, Moore’s law-governed technology — and electro-mechanical storage. Of course, on the other hand, perhaps it’s time to stop playing around with hard drives and start building mass storage arrays out of flash memory…

Read more at Technology Review, or about IBM Research Almaden or GPFS