A field trip to the Internet Archive

SAN FRANCISCO -- Many people think of the Internet Archive only as the home of the Wayback Machine, the site that lets you see what pages looked like years ago.

But the archive is also of the real world, a 501(c)(3) nonprofit organization that makes its home in a former church in the Richmond neighborhood here. Archive founder Brewster Kahle took an hour to show me around the place and talk about its work -- an increasing amount of which has little to do with old Web pages.

The archive moved into this building, an old Christian Science church, last November, and as a result its lobby still features a large collection of boxes. (Kahle noted that the building dates to 1923, "the last year of the public domain"; most works created since then remain under copyright.) The main hall still looks like a church, down to the pews, but Kahle aims to eventually rebuild it into a library of sorts. Kahle and other staffers have their offices on the lower level.

Next door, the old Christian Science reading room has been turned into a scanning center, as part of the archive's mission to preserve print as well as pixels. On each side, staffers were operating specialized scanners -- operated by pedals, like old sewing machines -- that photograph two pages of a book at a time. In the center, other employees were running computer-driven microfilm scanners. "That looks like the 1900 census," Kahle said as he peered over one staffer's shoulder at a screenful of handwritten documents.

Poring over page after page in a room made hot by that accumulation of computing machinery seemed like it could get a tad repetitive. I asked Kahle if there was a risk of burnout. Yes, he said, pointing to himself as an example of the wrong sort of person for that work: "I would get fired!" But some employees, he said, have been there three years.

One of the archive's newer projects is a site called Open Library, which both catalogues books and provides access to electronic copies of them. Anyone can download public-domain works, while visually impaired users can access text-to-speech versions of works through a program set up by the Library of Congress. The archive is also working to set up a system for direct downloads of e-book loans.

Much of the archive's work with books might seem to duplicate what Google is already doing with its Google Books site. But Kahle (who doesn't own an e-book reader) objected to the way some libraries have begun to rely on Google's collections and "de-accessioning" paper copies -- that is, trashing them, which seems the sort of thing that happens only in science-fiction novels. He'd rather see libraries keep their original source material while also using the Internet to make that content available to more people. "Let's not lose it all," he said.

Funding for the archive, director of administration Jacques Cressaty said, comes from foundation grants and donations (plus a subsidy from the city of San Francisco to underwrite some employees' salaries) and from fees earned by providing indexing and scanning services to other libraries. Last year, he said, about 40 percent of its income was contributed and 60 percent came from services.

I wrapped up our interview by asking Kahle for his preferred file formats for long-term storage, since I get that kind of question fairly often from readers. He said the archive uses FLAC (Free Lossless Audio Compression) for music, had adopted H.264 for video storage after trying five other formats, used JPEG for photos and employed a related format, JPEG 2000, for text-heavy images. But he also said that for personal storage, PDF or nearly universally supported commercial formats -- even Microsoft Office -- would be fine, too.

Anything else you'd like to know about the archive or Kahle? Post your questions in the comments, and I'll try to get them answered.