The Web’s Past is Not Evenly Distributed

This is the 4th post in MITH’s Digital Stewardship Series. In this post Ed Summers discusses ways to align your content with the grain of the Web so that it can last (a bit) longer.

If I had to guess I would bet you found the document you are reading right now by following a hyperlink. Perhaps it was a link in a Twitter or Facebook status update that a friend shared? Or maybe it was a link from our homepage, or from some other blog? It’s even possible that you clicked on the URL for this page in an email you received. We’re not putting MITH URLs on the sides of buses, on signs, or in magazines (yet), but people have been known to do that that sort of thing. At any rate, we (MITH) are glad you arrived, because not all links on the Web lead somewhere. Some links lead into dead ends, to nowhere–to the HTTP 404. Here is how historian Jill Lepore describes the Web in her piece The Cobweb:

The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable. Sometimes when you try to visit a Web page what you see is an error message: Page Not Found. This is known as link rot, and it’s a drag, but it’s better than the alternative.

I like this description, because it hints at something fundamental about the origins of the Web: if we didn’t have a partially broken Web, where content is constantly in flux and sometimes breaking, it’s quite possible we wouldn’t have a Web at all.

Broken By Design

When Tim Berners-Lee created the Web at CERN in 1989 he had the insight to allow anyone to link to anywhere else. This was a significant departure from previous hypertext systems which featured link databases that ensured pages were interlinked properly. Even when the Web only existed on his NEXT workstation Berners-Lee was already thinking of a World Wide Web (WWW) where authors could create links without needing to ask for permission, just like they used words. Berners-Lee understood that in order for the Web to grow he had to cede control of the link database. Here is Berners-Lee writing about these early days of the Web:

For an international hypertext system to be worthwhile, of course, many people would have to post information. The physicist would not find much on quarks, nor the art student on Van Gogh, if many people and organizations did not make their information available in the first place. Not only that, but much information — from phone numbers to current ideas and today’s menu — is constantly changing, and is only as good as it is up-to-date. That meant that anyone (authorized) should be able to read it. There could be no central control. To publish information, it would be put on any server, a computer that shared its resources with other computers, and the person operating it defined who could contribute, modify, and access material on it. (Weaving the Web, p. 38-39).

So in a way the Web is necessarily broken by design–there is no central authority making sure all the links work. The Web’s decentralization is one of the primary factors that allowed it to germinate and grow. I don’t have to ask for permission to create this link to Wikipedia or to anywhere else on the Web. I simply put the URL in my HTML and publish it. Similarly, no one has to ask me for permission to link here. An obvious consequence of this freedom is that if the page I’ve linked to happens to be deleted or moved somewhere else, then my link will break.

Fast forward through 25 years of exponential growth and we have a Web where roughly 5% of the URLs collected using the social bookmarking site Pinboard break per year. URLs shared in social media are estimated to fare even worse with about 11% lost per year. In 2013 a group at Harvard University discovered that 50% of the links in Supreme Court opinions no longer link to the originally cited information. To paraphrase William Gibson, the past isn’t evenly distributed.

There is no magic incantation that will make your Web content permanent. At the moment we rely on the perseverance and care of Web archivists like the Internet Archive, or your own organization’s Web archiving team, to collect what they can. Perhaps efforts like IPFS will succeed in layering new, more resilient protocols over or underneath the Web. But for now you are most likely stuck with the Web we’ve got, and if you’re like MITH, lots of it.

Fortunately, over the last 25 years Web developers and information architects have developed some useful techniques for working with this fundamental brokenness of the Web. I’m not sure if these are best practices as much as they are practical recipes. If you tend a corner of the Web here are some very practical tools available to you you for helping mitigate some of the risks of link rot.

Names

There are only two problems in computer science: cache invalidation, naming things and off-by-one errors. (Phil Karlton by way of Martin Fowler)

Naming things is hard partly because of semantic drift. The things our names seem to refer to have a tendency to shift and change over time. For example if I create a blog at wordpress.example.org and a few years later decide to use a different blogging platform, I’m kind of stuck with a hostname that no longer fits my website. When creating content on the Web it’s important to be attentive to the hostnames you pick for your websites. Unfortunately most of us don’t have time machines to jump in to see what the future is going to look like. But we do have memories of the often tangled paths that lead to the present state of a project. Draw upon these histories when naming your websites. Turn naming into a cooperative and collaborative exercise.

Naming on the Web is ultimately about the Domain Name System (DNS). DNS is a distributed database that maps a hostname like mith.umd.edu to its IP address 174.129.6.250, which is a unique identifier for a computer that is connected to the Internet. Think of DNS as you would your address book, which lets you look up the phone number or mailing address for a person you know using their name. When your friend moves from one place to another, or changes their phone number you update your address book with the latest information. Establishing a DNS name for your website like mith.umd.edu lets you move the actual machine around, say from your campus IT department, to a cloud provider like Amazon Web Services, which changes its IP address, but without changing its name. DNS is there to provide stability to the Web and the underlying Internet.

DNS is distributed in that there are actually many, many address books or zones that are hierarchically arranged in a pyramid like structure, with one root zone at the top. In the case of mith.umd.edu the address book is managed by the University of Maryland. In order to change the hostname mith.umd.edu to point at another IP address I would need to contact an administrator at UMD and request that it be changed. Depending on my trust relationship with this administrator they might or might not fulfill my request. Sometimes you might want to purchase a new domain for your site like namingthingsishard.org from a DNS Registrar. After you paid the registrar an annual fee, you would be able to administer the address book yourself. But with great power comes great responsibility.

Given its role in making Internet names more stable it’s ironic that DNS is often blamed as a leading factor for link rot. It’s true, if you register a new domain and forget to pay your bill, a cybersquatter can snatch it up and hold it hostage. Alternative systems like Handles, Digital Object Identifiers and more recently NameCoin or IPFS have been developed partly out of an inherent distrust of DNS. These systems offer various improvements over DNS, but in some ways recapitulate the very problems that DNS itself was designed to solve. DNS has its warts, but it has been difficult to unseat because of its ubiquitous use on the Internet. Here are a few things you can do when working with DNS to keep your websites available over time:

Pick a hostname you and your team can collectively live with, at least for a bit. Think of it as a community decision.

If your new website fits logically in a domain that you already have access to use it instead of registering a new domain name. This way you have no new bills to pay, or management to do.

If you register a new domain keep your user registration account and contact information up to date. When a bill needs to be paid that’s who the registrar is going to get in touch with. When staff come and go make sure the contact information is changed appropriately.

Pick a registrar that lets you lock the domain to prevent unwanted updates and transfers.

If your registrar allows you to enable two-factor authentication do it. You’ll feel safer the next time you log in to make changes.

If the creator of a website has moved on to other organization, transfer the ownership of the domain to them. They are the ones making sure it stays online so it makes sense for them to own and administer the domain.

Redirects

A more practical solution than minting the perfect and unchanging name for your website is to accept that things change, and to let people know when these changes occur. An established way to announce a name change on the Web is to use the modest HTTP redirect. Think of an HTTP redirect as the forwarding address you give the post office when you move. The post office keeps your change of address on file, and forwards mail on to your new address, for a period of time (typically a year). The medium is different but the mechanics are quite similar, as your browser seamlessly follows a redirect from the old location of a document to its new location.

The Hypertext Transfer Protocol (HTTP) defines the rules for how Web browsers and servers to talk to each other. Web browsers make HTTP requests for Web pages using their URL, and Web servers send HTTP responses to those requests. Each type of response has a unique three digit code assigned to it. I already mentioned the 404 Not Found error above, which is a type of HTTP response that is used when the server cannot locate a resource with the URL that was requested. When the server returns a 200 OK response the browser know everything is fine and that it can display the page. One of the other class of HTTP responses servers can send are redirects, such as 301 Moved Permanently and 302 Found.

301 Moved Permanently is your friend in situations where a given website has moved to another location, as when a domain name is changed, or when the content of a website has been redesigned. Consider the example above when wordpress.example.org no longer works because another blogging platform is now being used. You can set up your webserver so that all requests for wordpress.example.org redirect permanently to blog.example.org.

The actual mechanisms for doing the redirect vary depending on the type of Web server you are running. For example, if you use the Apache Web server you can enable the mod_rewrite module and then put a file named .htaccess containing this in your document root: