The following is a guest post by Kalev H. Leetaru, University of Illinois, who presented these ideas at the 2012 General Assembly of the IIPC. This post is the first in a three-part series.

Imagine a world in which libraries and archives had never existed. No institutions had ever systematically collected or preserved our collective cultural past: every book, letter, or document was created, read and then immediately thrown away. What would we know about our past? Yet, that is precisely what is happening with the web: more and more of our daily lives occur within the digital world, yet more than two decades after the birth of the modern web, the “libraries” and “archives” of this world are still just being formed.

We’ve reached an incredible point in society. Every single day a quarter-billion photographs are uploaded to Facebook, 300 billion emails are sent and 340 million tweets are posted to Twitter. There are more than 644 million websites with 150,000 new ones added each day, and upwards of 156 million blogs. Even more incredibly, the growth rate of content creation in the digital world is exploding. The entire New York Times over the last 60 years contained around 3 billion words. More than 8 billion words are posted to Twitter every single day. That’s right, every 24 hours there are 2.5 times as many words posted to Twitter as there were in every article of every issue of the paper of record of the United States over the last half-century.

By some estimates there have been 50 trillion words in all of the books published over the last half-millennia. At its current growth rate, Twitter will reach that milestone less than three years from now. Nearly a third of the planet’s population is now connected to the internet and there are as many cell phones as there are people on earth. Yet, for the most part we consume all of this information as it arrives and discard it just as quickly, giving little thought to posterity. That’s where web archives come in: to make sure that a few years, decades, centuries, and millennia from now we will still have at least a partial written record of human society at the dawn of the twenty-first century.

The Web Archive in Today’s World

The loss of the Library of Alexandria, once the greatest library on earth, created an enormous hole in our understanding of the ancient world. Imagine if that library had not only persisted to present day, but had continued to collect materials through the millennia? Yet, in the web era, we are repeating this cycle of loss, not through a fire or other sudden event that destroyed the Library of Alexandria, but rather through inaction: we are simply not collecting it.

The dawn of the digital world exists in the archives of just a few organizations. Many mailing lists and early services like Gopher have largely been lost, while organizations such as Google have invested considerable resources in resurrecting others like USENET. The earliest years of the “web” are gone forever, but beginning in 1996 the Internet Archive began capturing snapshots, giving us one of the few records of the early iterations of this world. Organizations like the International Internet Preservation Consortium are helping to bring web archivists from across the world and across disciplines together to share experiences and best practices and forge collaborations to help advance these critical efforts.

Unintended Uses

Archives exist to preserve a sample of the world for future generations. They accept that they cannot archive everything and don’t try to: they operate as an “opportunistic” collector. Traditional humanities and social sciences scholarship was designed around these limitations: the tradition of deep reading of a small number of works in the humanities was born out of this model. Yet, a new generation of researchers is increasingly using archives in ways they weren’t intended for and need a greater array of information on how those archives are created to anticipate biases and impacts on their findings.

The Library of Congress’ Chronicling America site, while technically a web-delivered digital library, not a web archive, offers an example of why greater insight into the archiving process is critical for research. Using the site recently for a project, my search returned ten times as many hits for my topic in El Paso, Texas newspapers, as it did for New York City. Further inspection showed this was actually because the Chronicling America site had more content from El Paso newspapers during this time period than it did from New York City papers, rather than this being a reflection of El Paso papers covering my topic in more detail. Part of this issue stems from the acquisition model of Chronicling America: each individual state determines the order it digitizes newspapers printed in its borders: one state might begin with smaller papers, while other begins with larger papers, one state might digitize a particular year from every paper, while another might digitize the entirety of each paper in turn. Chronicling America also excludes papers that have been digitized already by commercial vendors: thus New York City’s largest paper, the New York Times, is not present in the archive. This landscape introduces significant artifacts into searches, but normalization procedures can help address them. In order to do so, however, a bibliography is needed that lists every page from every paper that has been included in the archive. This would have allowed me to switch my search results from a raw count of matching newspaper pages into a percent of all pages from each city, which would have accounted for there being more content in El Paso than New York City.

This is even clearer when conducting searches of the historic New York Times. A search of the Times for any keyword over the period 1945-present will show its use declining by 50% over that period. This is not a reflection of that term declining in use, but rather reflects the fact that the Times itself shrunk by more than half over this period. Similarly, searches covering the year 1978 will show an 88 day period where the term was never used. This is not because the term dropped out of favor during that period, but rather because a machinist’s strike halted the paper’s publication entirely. Having an index of the total number of articles published each day (and thus the possible universe of articles the term could have been used in) allows the raw counts to be normalized to yield the true picture of the term’s usage. However, no web archive today offers such a master index of its holdings.

One of the core optimizations used by web crawlers can have a significant impact on certain classes of research. Nearly every web archive uses crawlers designed to measure the rate of change of a site (ie, how often on average pages on that site change) in order to crawl sites that change more often faster than those that rarely change. This allows bandwidth and disk storage to be prioritized towards sites that change often, rather than storing a large number of identical snapshots of a site that never changes. However, sometimes it is precisely that rare change that is most interesting. For example, when studying how White House press releases had changed, I was examining pages that should never show any change whatsoever, and when there was a change, I needed to know the specific day on which the change occurred to reconcile it with political winds at the time. However, the rare rate of the change on that portion of the site meant that snapshots often were months or sometimes years apart, making it impossible to narrow some changes down below the level of “several years.”

In other analyses, the dynamic alteration of the recrawl rate itself is a problem. For example, when studying the inner workings of the Drudge Report over the last half-decade, a key research question revolved around the rate at which various elements of that site changed. If the rate of snapshotting was being varied by a software algorithm based on the very phenomena I was measuring, that would strongly bias my findings. In that particular case I was lucky enough to find a specialty archive that existed solely to archive the Drudge Report, and which had collected snapshots every 2 minutes nonstop for more than 6 years.

This is not an easy problem, as archives must balance their very limited resources between crawling for new pages and recrawling existing pages looking for changes. Within recrawling, they must balance the need to pinpoint changes to the most narrow timeframe possible with ensuring they capture as many changes as possible from high-velocity sites.

Finally, the very notion of what constitutes “change” varies dramatically among research projects. Has a page “changed” if it still looks the same, but an HTML tag was changed? What about if the title changes, or the background color? Does a change in the navigation bar at the top count the same as a change to the body text? There are as many answers to these questions as there are research projects, and no single solution satisfies them all. When looking at changes to White House press releases, only a change to a page title or body text counted as “change,” while the Internet Archive counted all of the myriad edits and additions to the White House navigation bar as “changes.” This required downloading every single snapshot of each page and applying our own filters to extract and compare the body text ourselves. One possible solution to this might be the incorporation of hybrid hierarchical structural and semantic document models that allow a user to indicate which areas of the document he or she cares about and to return only those snapshots in which that section has changed.

What to Keep?

As noted in the introduction to this blog post, the digital world is experiencing explosive growth, producing more content in a few hours than was produced in the greater part of a century in the print era. This growth is giving us an incredible view of global society and enabling communication, collaboration, and social research at scales unimaginable even a decade ago, yet the richer this archive becomes, the harder it is to archive. The very volume of material that makes the web so exciting as a communications platform means there is simply too much of it to keep. Even in the era of books, there were simply too many of them for any library to keep, but at least we could assume that some library somewhere was probably collecting the books that we weren’t: an assumption that isn’t necessarily true in the digital world yet.

An age-old mechanism for dealing with overflow is to determine which works are the most “important” and which can be discarded. Yet, how do we decide what constitutes “noise” and what should be kept? Talk to a historian writing a biography of a historic figure and he or she will likely point to routine day-to-day letters and diary entries as a critical source of information on that person’s mood, feelings, and beliefs. Emerging research on using Twitter to forecast the stock market or measure public sentiment are finding that only when one considers the entirety of all 340 million tweets each day do the key patterns emerge. A tweet of “I’m outside hanging the laundry, such a beautiful day” might at first seem a prime candidate for discarding, but by its very nature, it reflects an author feeling calm and secure and relaxed: critical population-level dynamics of great interest to social scientists. Another mechanism is to discard highly similar works, such as multiple editions of the same work. Yet, an emerging area of research on the web is the tracing of “memes,” which are variations of a quote or story that evolve as they are forwarded across users and communities much like a realtime version of the “telephone game.” It is critical for such research to be able to access every version of a story, not just the most recent.

The rise of dual electronic + print publishing pipelines has led to the need to collect TWO copies of a work, instead of just a single authoritative print edition. Digital editions of books released as websites may include videos, photographs, multimedia and interactive features that provide a very different experience from the print copy. Even in subject domains where print is still the “official” record, digital has become the defacto record through its ease of access. How many citizens travel to their nearest Federal Depository Library and browse the latest edition of the Public Papers of the President to find press releases and statements by their government? Most likely turn instead to the White House’s website, yet a study I coauthored in 2008 found that official US government press releases on the White House website were being continually edited, with key information added and removed and dates changed over time to reflect changing political realities. In a world in which information is so easily changed and even supposedly immutable government material changes with a click of a mouse, how do we as web archivists capture this world and make it available?

This brings up one very critical distinction between the print and digital eras: the concept of change. In the print era, an archive simply needed to collect an item as it was published. If a book was subsequently changed, the publisher would issue a new edition and notify the library of its availability. A book sitting on a shelf was static: if 20 libraries each held a copy of that book, they could be reasonably certain that all 20 copies were identical to each other. In the digital era, we must constantly scour for new pages to archive, but we also have a new role: checking our existing archive for change. Every single page every saved by the archive must be rechecked on a regular basis to see if it has changed. Websites don’t make this easy. A study of the Chicago Tribune I conducted for the Center of Research Libraries in 2011 found there was no single master list of articles published on the Tribune’s site each day and the RSS feeds were sorted by popularity, not date. To ensure one archived every new article posted to the site, an archivist would have to monitor all 105 main topic pages on the Tribune’s site every few hours or risk losing new articles on a news-heavy day. At the level of the web as a whole, one can monitor the DNS domain registry to get a continually-updated list of every domain name in existence. However, even this provides only a list of websites like “cnn.com,” not a list of all of the pages on that site.

In the era of books, a library needn’t purchase a work the day it was released, as most books continued to be printed and available for at least months, if not years afterwards. A library could wait a year or two until it had sufficient budget or space to collect it. Web pages, on the other hand, may have halflives measured in seconds to minutes. They can change constantly, with no notice, and the velocity of change can be extreme. In addition, more content is arriving in “streaming” format on the web. Archiving Twitter requires being able to collect and save over 4,000 messages per second in realtime, with no ability to “go back” for missed ones. A network outage of 10 minutes means 2.5 million tweets that have been lost forever. In the web world, content producers set the schedule for collection and archivists must adhere to those schedules.

Myron Gutmann, Assistant Director of the National Science Foundations’ Directorate for Social, Behavioral, & Economic Sciences recently gave a talk earlier this year where he argued that in the print era the high cost of producing information meant that whatever was published was “worth keeping” because there were so many layers of review. In contrast, the tremendously low cost of publication in the digital era means anyone can publish anything without any form of review. This raises the question even in scholarly disciplines of what is “worth” keeping? If an archive becomes too full and a massive community of researchers is served by one set of content and just 10 users are served by another collection of material, whose voice matters the most in what is deleted? How do we make decisions about what to keep? Historically those decisions were made by librarians or archivists by themselves, but as users and data miners become increasing users of archives, this raises the question of how to engage those communities in these critical decisions.

The Rise of the Parallel Web

When we speak of “archiving the web” we often think of the “web” as a single monolithic entity in which all content that is produced or consumed via a web browser is accessible for archiving. The original vision of the web was based on this ideal: an open unified platform in which all material was available to all users. For the most part this vision survived the early years of the web, as users strove to reach the greatest possible audience. Yet, a new trend has begun over the past half-decade, corresponding with the rise of social media: the creation of “parallel” versions of the web.

Every one of those quarter-billion photographs uploaded to Facebook each day is posted and consumed via the web, whether through browser on a desktop or a mobile app on a smartphone. Yet, despite transiting the same physical telecommunications infrastructure as the rest of the web, those photos are stored in a parallel web, owned and controlled entirely by a commercial entity. They are not part of the “public” web and thus not available to web archives. In many ways this is no different than the libraries and archives of the print era. Libraries focused on collecting books and pamphlets, while a good deal of communication and culture occurred in letters, diaries, drawings, and artwork that have largely been lost. The difference in the digital era is that instead of being scattered across individual households, all of this material is already being centralized into commercially-owned “archives” and “libraries.”

Not everyone desires every conversation of theirs to be preserved for posterity, but in the print era one had a choice: a letter or diary or photograph was a physical object, held by its owner and could be passed down to later generations. How many of us have come across a shoebox of old photographs or letters from a grandparent? In the digital era, a company holds that material on our behalf and while most have terms of service that agree we “own” our material, only one major social media platform today offers an “export” button that allows us to download a copy of the material we have given it over the years: Google Plus’ Google Takeout. Twitter has recognized the importance of the communications that occur via its service and has made a feed of its content available to the Library of Congress for archiving for posterity. Most others like Facebook and international platforms like Weibo or VK (formerly VKontakte) have not. Facebook in effect has become a parallel version of the web, hosted on the web, but walled off from it, with no means for users to archive their material for the future.

Twitter offers a shining example of how such platforms can interact with the web archiving community and ensure that their material is archived for future generations. Self-archiving services like Google Takeout offer an intermediate step in which users at least retain the ability to make their own archival copy of their contributions to the web for future generations. As more of the web moves behind paywalls, password protection, and other mechanisms, creating more and more parallel versions of the web, there must be greater discussion within the web archiving community about how we reach out to these services to find ways of ensuring users of these communities may archive their material for the future.

Tomorrow, the second in this three-part series will cover using web archives for research.