by Brewster Kahle, June 2011 Press on this: NYtimes



Books are being thrown away, or sometimes packed away, as digitized versions become more available. This is an important time to plan carefully for there is much at stake.

Digital technologies are changing both how library materials are accessed and increasingly how library materials are preserved. After the Internet Archive digitizes a book from a library in order to provide free public access to people world-wide, these books go back on the shelves of the library. We noticed an increasing number of books from these libraries moving books to “off site repositories” (1 2 3 4) to make space in central buildings for more meeting spaces and work spaces. These repositories have filled quickly and sometimes prompt the de-accessioning of books. A library that would prefer to not be named was found to be thinning their collections and throwing out books based on what had been digitized by Google. While we understand the need to manage physical holdings, we believe this should be done thoughtfully and well.

Two of the corporations involved in major book scanning have sawed off the bindings of modern books to speed the digitizing process. Many have a negative visceral reaction to the “butchering” of books, but is this a reasonable reaction?

A reason to preserve the physical book that has been digitized is that it is the authentic and original version that can be used as a reference in the future. If there is ever a controversy about the digital version, the original can be examined. A seed bank such as the Svalbard Global Seed Vault is seen as an authoritative and safe version of crops we are growing. Saving physical copies of digitized books might at least be seen in a similar light as an authoritative and safe copy that may be called upon in the future.

As the Internet Archive has digitized collections and placed them on our computer disks, we have found that the digital versions have more and more in common with physical versions. The computer hard disks, while holding digital data, are still physical objects. As such we archive them as they retire after their 3-5 year lifetime. Similarly, we also archive microfilm, which was a previous generation’s access format. So hard drives are just another physical format that stores information. This connection showed us that physical archiving is still an important function in a digital era.

There is also a connection between digitized collections and physical collections. The libraries we scan in, rarely want more digital books than the digital versions that we scan from their collections. This struck us as strange until we better understood the craftsmanship required in putting together great collections of books, whether physical or digital. As we are archiving the books, we are carefully recording with the physical book what the identifier for the virtual version, and attaching information to the digital version of where the physical version resides.

Therefore we have determined that we will keep a copy of the books we digitize if they are not returned to another library. Since we are interested in scanning one copy of every book ever published, we are starting to collect as many books as we can.

We hope that there will be many archives of physical books and other materials as they will be used and preserved in different ways based on the organizations they reside in. Universities will have different access policies from national libraries, say, and mostly likely different access policies from the Internet Archive. With many copies in diverse organizations and locations we are more likely to serve different communities over time.

Physical Archive of the Internet Archive

Internet Archive is building a physical archive for the long term preservation of one copy of every book, record, and movie we are able to attract or acquire. Because we expect day-to-day access to these materials to occur through digital means, the our physical archive is designed for long-term preservation of materials with only occasional, collection-scale retrieval. Because of this, we can create optimized environments for physical preservation and organizational structures that facilitate appropriate access. A seed bank might be conceptually closest to what we have in mind: storing important objects in safe ways to be used for redundancy, authority, and in case of catastrophe.

The goal is to preserve one copy of every published work. The universe of unique titles has been estimated at close to one hundred million items. Many of these are rare or unique, so we do not expect most of these to come to the Internet Archive; they will instead remain in their current libraries. But the opportunity to preserve over ten million items is possible, so we have designed a system that will expand to this level. Ten million books is approximately the size of a world-class university library or public library, so we see this as a worthwhile goal. If we are successful, then this set of cultural materials will last for centuries and could be beneficial in ways that we cannot predict.

To achieve a goal of long-term preservation we have assumed:

Infrequent access,

Manage millions of books, records, and movies,

Adapt to needs of different physical media and collection value,

Facilitate storage evolution by monitoring existing systems and introducing new ideas,

Adapt to multiple facilities in different environments, and

Sustainable from a financial and maintenance perspective.

To start this project, the Internet Archive solicited donations of several hundred thousand books in dozens of languages in subjects such as history, literature, science, and engineering. Working with donors of books has been rewarding because an alternative for many of these books was the used book market or being destroyed. We have found everyone involved has a visceral repulsion to destroying books. The Internet Archive staff helped some donors with packing and transportation, which sped projects and decreased wear and tear on the materials.

These books are digitized in Internet Archive scanning centers as funding allows.

To link the digital version of a book to the physical version, care is taken to catalog each book and note their physical locations so that future access could be enabled. Most books are cataloged by finding a record in existing library catalogs for the same edition. If no such catalog record can be found, then it is cataloged briefly in the Open Library. Links are made from the paper version to the digital version by printing identifying and catalog data on a slip of acid free paper that is inserted in the book. Linking from the digital version to the paper version is done through encoding the location into the database records and identifiers into the resulting digital book versions. The digital versions have been replicated and the catalog data has been shared.

Most of these first books have been digitized with funding from stimulus money for jobs programs and funding from the Kahle/Austin Foundation. This served to build the core collection of modern books for the blind and dyslexic. Many of these digital books are also available to be digitally borrowed through the Open Library website.

This was a change from our previous mass digitization procedures when a library would deliver and retrieve books from our scanning centers. Where the libraries would have already done the sorting and de-duplication of books, we now need to do these functions ourselves. The process to identify titles that have not been preserved already is now in place, but is in active development to improve efficiency. The thorough work of libraries in cataloging materials is key in this process because we can leverage this for these books. Identifiers such as ISBN, LCCN, and OCLC ids have helped determine which books are duplicates.

In January of 2009, we started developing the physical preservation systems. Fortunately there is a wealth of literature on book preservation documenting studies on the fibers of paper as well as results from multi-year storage experiments. Based on this technical literature and specifications from depositories around the world, Tom McCarty, the engineer who designed the Internet Archive’s Scribe book-scanning system, began to design, build, and test a modular storage system in Oakland California. This system uses the infrastructure developed around the most used storage design of the 20th century, the shipping container. Rows of stacked shipping containers are used like 40′ deep shelving units. In this configuration, a single shipping container can hold around 40,000 books, about the same as a standard branch library, and a small building can hold millions of books.

Based on this success and the increasing availability of physical materials, a production facility leveraging this design will be launched in June of 2011 in Richmond, California. The essence of the design from the book’s point of view is to have several layers of protection, each able to be monitored and periodically inspected:

Books are cataloged, and have acid free paper inserts with information about the book and its location,

Boxes store approximately 40 books with labeling on the outside,

Pallets hold 24 boxes each,

Modified 40′ shipping containers are used as secure and individually controllable environments of 50 or 60 degrees Fahrenheit and 30% relative humidity,

Buildings contain shipping containers and environmental systems,

Non-profit organizations own and protect the property and its contents.

This physical archive is designed to help resist insects and rodents, control temperature and humidity, slow acidification of the paper, protected from fire, water and intrusion, contain possible contamination, and endure possible uneven maintenance over time. For these reasons the books are stored in isolated environments with a regulated airflow that depends on few active components.

The Internet Archive is now soliciting further donations of published materials from libraries, collectors, and individuals.

This collection and methodology has already helped in mass digitization and preservation, and we hope that we will offer a wealth of knowledge to future generations.

Thank you to Tom McCarty, Robert Miller, Sean Fagan, Internet Archive staff, San Francisco Public Library leadership, Alibris, HHS of the City of San Francisco, and the Kahle/Austin Foundation for being leaders on this project.