Wednesday, at an event in Santa Clara, Sun Microsystems and the Internet Archive announced a joint effort to move the Archive's growing, three-petabyte (about 150 Libraries of Congress) data store into one of Sun's Modular Datacenters—the famous datacenter in a shipping container, which we've covered previously.

The Archive, which also hosts the ever-popular Wayback Machine, currently runs on a custom storage architecture. But, in keeping with the group's mission of open access to information, they opted to move it to a Sun MD that's based on Sun Fire x4500 servers and ZFS.

"The Internet Archive offers long term digital preservation to the ephemeral Internet," said Brewster Kahle, founder of the Internet Archive organization. "As more of the world's most valuable information moves online and data grows exponentially, the Internet Archive will serve as a living history to ensure future generations can access and continue to preserve these important documents over time."

Preserving data over time is a beast of a problem, much more challenging than most people realize. The problem of data rot is great for Sun, though, because it means that they'll get to keep selling even more upgrades to the Archive. Let me elaborate.

If you want to preserve info, write it down

For the past year, I've been trying to either write or commission an article on the problem of data rot, which is pretty much what it sounds like: no matter what medium you use to store your digital data, it probably won't be there in readable form a few years later unless you keep copying it to ever newer media.

Fortuitously, David Pogue has a great interview on exactly this topic in today's New York Times. Pogue talked to Dag Spicer, curator of the Computer History Museum in Mountain View, CA. (I was just there for an event two weeks ago, and it's fantastic. I look forward to bringing my daughter when she's old enough.)

DP: Hasn’t anyone tried to create a truly permanent storage medium? DS: One of the technologies for really long-term preservation was developed at Lawrence Livermore National Laboratory. It was, I think, a titanium disk about the size of a long-playing record, and it was supposed to last 10,000 years. But then they realized that there were some assumptions that weren’t right, and that it would not last 1,000 years, it might only last 20. Otherwise, as far as I know, no one is working on this problem. It’s really in no one’s interest, no manufacturer’s interest; they want to keep selling you more hard drives every two to five years, or more blank CDs, and what have you. And that’s why it’s almost like your retirement, it’s something you have to take responsibility for yourself. No one is going to do it for you.

If you think about Spicer's comments in light of the ongoing collapse of print and the inevitability of most publications' transition to an online-only format, it's clear that future historians could look back at the first part of the 21st century and see a giant lacuna. I know that I personally have little confidence that my children and grandchildren will be able to access and read the million or so words I've written for Ars over the years, and I'll admit that this concerns me.

If the only way to preserve data is to keep copying it to new media, then all that has to happen for my work to disappear is for someone at some point to just not copy it. And it seems certain that this will happen, eventually—there will be an upgrade, and a data migration, and someone just won't copy the Ars back catalog, and then poof.

I hate to end this article on a down note, but I need to shop for acid-free printer paper.

Listing image by Sun