The End of Term Web Archiving drew in more attention from the media than any previous U.S. administration turnover since the project began in 2008 during the change from the Bush administration to the Obama administration. The potential loss of public access to billions of bytes of government data made people realize just how fragile the current structure of the internet is.

It also put the librarians and digital preservation under the spotlight. Some made comments that the intensive archiving was unnecessary, and that those involved were “all hysterical” and “being overly political,” Janz says, while others saw these stereotypical “old, frumpy” librarians as vigilantes, activists, hackers, and superheroes, Rabina says.

Suddenly, as more people became excited about the archiving, the work librarians had been doing forever “became sort of sexy and trendy, and you could call it different things like ‘rescue’ and ‘capture’ and ‘save’ and ‘Wonder Woman to the rescue,’” says Rabina. “I’m very tickled that all these hipsters now think that this is sexy work, but you know this is all a little mundane work sometimes. It’s like writing a lot of citations and footnotes.”

A quiet book-lined room full of librarians performing the task of archiving websites—URL by URL—may not have been the most riveting scene to onlookers, as Rabina describes. However, the duty of ensuring the longevity of digital information places a heavy burden on a small community.

At various data rescuing events across the U.S., volunteer librarians nominated and identified URLs they thought were at risk or were worth saving, including agency sites and social media posts. Then, the websites would be backed up in digital repositories hosted by various partner institutions such as the Internet Archive, the University of North Texas Libraries, and even the Library of Congress.

“We try not to feel too guilty of missing something,” says Abigail Grotke, a web archivist at the Library of Congress who also collaborated on the 2016 End Of Term Harvest. “We do the best we can with the resources we have.”

Abigail Grotke, a web archivist at the Library of Congress. Credit: Shawn Miller/animation by Brandon Echter

At the Library of Congress, archivists often revisit all kinds of websites on the internet over time using crawlers, or “spider” bots that move through the web and copy or scrape pages. The bots systematically visit sites and download content like a search engine, explains Grotke.

Crawling the web can sometimes be slow and tedious work. However, archiving some types of digital content brings about an air of urgency. Particular time sensitive events require web archiving, like the Olympics, government campaigns (sites that are notorious for being quickly removed or changed shortly after the election), and the end of term, Grotke explains.

“We know there's this sort of deadline over the inauguration when things are going to change,” Grotke says. “So there is a frenzy of activity to preserve in that time period.”

In the previous 2012 cycle, the End of Term Web Archive captured 3,247 websites and 21 terabytes of data. By the end of the 2016 harvest, 11,382 websites were nominated to be saved. The official numbers of what has been collected has yet to be released, but within just the Library of Congress, volunteers, librarians, and partners collected about 155 terabytes of government web content and data, Grotke estimates, while the Internet Archive worked on a related project storing 100 terabytes.

t’s still virtually impossible to get every bit of data, Grotke says. For instance, government websites alone host a lot of data (Janz estimates on the petabyte level at the very least), which puts the archivists at technical odds.

The web crawlers can’t collect certain kinds of data. Since the bots are programmed to crawl the web like a search engine, they encounter issues with forms that must be filled out, interactive pages, or streaming media, for instance.

Independent data rescuing projects like Data Refuge, by Penn Libraries and the Penn Program in the Environmental Humanities, assisted the End of Term Web Archive’s efforts. During Data Refuge’s events, “data rescuers” were able to nominate pieces that the End of Term Web Archive wouldn’t be able to get to or its crawler couldn’t pick up, Janz says.

Skilled data rescuers would download or “scrape” the data, where “they have to pull some kind of script to pull the data out of the site,” she explains.

“Nothing is very static on the web. So in order to capture changes in websites, we have to constantly try to archive it.”

Digital librarians also encounter challenges with privatized or closed content. Some social media and services like Facebook can’t be crawled easily, says Bailey. The Internet Archive does collect social media, but Facebook, in particular, “seemingly intentionally makes it challenging,” he says. Similarly, apps create highly blocked-off environments.

Bailey calls these kinds of sanctioned communities and content “walled gardens.” From a business standpoint, the walls are there for your personal protection and privacy, but it makes it difficult for archival practices. Our digital history may end up being a disjointed record of our culture as a result.

“We’re going to have a very strange view of the early 21st century, and in some sense all sorts of details will be recorded, but some of the very important things will not be recorded,” says Kahle, who supports a more open access outlook on the web.

More and more data is born natively in a digital environment. If it isn’t saved by archivists, the data dies on the internet—seemingly difficult to near impossible to recover.

“It’s the wild, wild west in terms of what’s out there,” says Grotke. “Nothing is very static on the web. So in order to capture changes in websites, we have to constantly try to archive it.”

It’s been a little over a year since the data harvesting began. As of now and according to Toly Rinberg of EDGI, pages have been cut and altered, but no data sets have been removed from environmental government agency sites. Only a few reports on animal welfare that were scheduled to be taken down before President Trump’s inauguration have been deleted (and are slowly making their way back online after public outcry), as well as the Department of Energy staff directory and a climate modeling tool from the Department of Transportation (which luckily the Internet Archive saved), Janz says.

So far to Janz’s knowledge, there have been very few to no requests for the government data archived during Data Refuge’s rescuing events. She sees it as a good thing that nobody has needed the data that was saved. But the efforts weren’t fruitless.

“We’ve learned a lot about how government agencies are creating this data, how they back things up, how vulnerable they view it,” she says.

he upending of government data and websites is part of a larger systemic problem, one that librarians have been struggling with for decades.

On a daily basis, digital and web librarians sift through the deluge of data. When the web archiving program first started at the Library of Congress, the amount of data managed was small, Grotke says. Now, the archivists have over a petabyte (a million gigabytes)—bringing in about 20 to 25 terabytes (a thousand gigabytes) a month.

And as the amount of online content and the size of websites continue to grow exponentially, they must make executive decisions on what to save and what to sacrifice.

Still, not every bit of data can be salvaged, and not every bit of data necessarily needs to be saved, Grotke explains. What’s kept are the most significant snapshots of what’s happening on the internet right now. And, hopefully, those snapshots will help people in the future: “Our collections will probably be really valuable in about 50 years when some of these websites are really long gone,” says Grotke.

While the internet is a much more unpredictable environment to build an archive out of, the principles of librarianship remain the same. Humans stockpile and organize information to conquer time, says Kari Kraus of the University of Maryland. And librarians will continuously work to come up with new ways to preserve our digital history.

“I like to think of collection as a service,” Rabina says. Just because information may be free now, it doesn’t mean it’ll be free forever, she says. “If you don't collect, it's going to disappear.”