Plenty of data to sift through. But only if it lasts (Image: Sam Diephuis/Getty)

Read more: “Forever online: Your digital legacy“

The historians of 2061 will want to study the birth of the world wide web. How on earth will they know where to start?

Today, historians have to piece together the details of their subjects’ lives from tiny scraps of evidence. Their successors are more likely to be overwhelmed: the problem will be making sense of our vast digital legacies. What techniques will they use to make sense of this deluge?


Many of us now generate more data than we can manage – think of all those holiday pictures you’ll never get round to organising into an album. The contents of our hard drives are jumbled messes; the web’s lack of structure, coupled with anonymity and the use of aliases, will make the online world an equally formidable challenge for future historians.

All the HTML, MP3 and JPEG files that make up today’s web are likely to remain readable for a very long time. But unpicking their original provenance and authenticity will be no mean feat, because data is often duplicated, edited, annotated and modified.

To safeguard our files, we tend to back them up, email documents to ourselves or post pictures online. Files also get passed between people. These actions often change the file, yet most of these changes are minor and usually invisible to a human being.

This is a mixed blessing for internet archaeologists. On one hand, the variations provide valuable insight into how information has spread. On the other, it makes it difficult to establish where it first came from, as anyone who’s ever tried to track down the origins of an internet meme will appreciate.

Fuzzy filter

A brute-force way of sifting through all these files for provenance is “hashing”: a mathematical technique that summarises a large piece of data as a much smaller number – or “hash value” – making it easy to compare files. But because even a tiny change to the original data will result in a completely different hash value, it can be hard to see the relation between copies.

Breaking each file up into segments and creating a separate hash for each segment can reveal when two files are mostly composed of identical segments and are thus likely to be related.

Such “fuzzy hashes” can be used to find near-identical copies, or to identify incomplete or early drafts – information that a biographer might find helpful.

The technique is not perfect, though: its ability to spot similarities is, well, fuzzy, and it works better for some file types than others. Compressing a picture slightly, for example, doesn’t affect its appearance very much, but can change its hash values dramatically.

Write stuff

What about text? The internet is full of anonymous comments, status updates and blog posts. Historians may want to unmask the authors.

One way to do that is look for their characteristic “writeprints“: their vocabulary, the length of the sentences they use, words and punctuation patterns they’re particularly fond of, and even habitual grammatical mistakes.

Normally this requires a substantial chunk of text to work on, but researchers at the National Institute for Computing and Automation Research in Grenoble, France, have designed a system that can link different aliases used by one person, using only the characters that make up their usernames.

You can try a simple version of this approach on the website I Write Like, which tells you which famous writer’s output your own deathless prose most resembles. But I Write Like also illustrates some of the difficulties of this approach, notoriously failing to identify some of the writers it actually uses as references.

More sophisticated approaches would undoubtedly do better, but changes in our writeprints over time again make it hard to be definitive about the author of a work. (Then again, such changes can be illuminating for literary sleuths: analysis of Agatha Christie’s later works have been used to support suspicions that she suffered from dementia.)

Finding meaning

Writeprints confine themselves to the structure of text, but semantic analysis tools go further – trying to identify relevant information in the meaning of the text. That could help future researchers work out what you were like without having to trawl through every one of your status updates.

Defuse, a system under development by Aaron Zinman at the Massachusetts Institute of Technology, represents individual commenters on a website as coloured blocks, based on the kind of language they use and how closely they conform to community norms. It’s an attempt to create a kind of “digital body”, he says – a pixel portrait that mimics our ability to size someone up at a glance in the physical world.

But Zinman cautions against interpreting the output of such systems too literally. “It’s important to understand how complex humans are,” he says. “A biography of someone important may be hundreds of pages long, but it’s still a condensed account of their life, written through a particular lens and with a particular objective. There are a million ways you can slice the data about a person, and they will look different in each one.”

That’s a point made more explicitly by Zinman’s earlier project, Personas, which purports to reveal how the web sees you by searching for “meaningful” statements.

Real messiness

When I tried Personas myself, it came up with “management, education, news”, which I’d say is more like a blurry telephoto picture of me than a finely detailed portrait. That’s the point: Zinman intended it to illustrate how poorly today’s machine learning captures the messiness of real people.

Viktor Mayer-Schönberger of the Oxford Internet Institute in the UK also strikes a cautionary note. “Digital memory only captures digital artefacts,” he says. “The more we depend on it, the more tempted we are to attribute qualities to it that it doesn’t actually have, like authenticity and comprehensiveness.”

So even if the tools of the trade improve immeasurably over the next half-century, they’ll still be limited by the records we leave behind us. While those records are becoming ever richer, with our locations and even our heartbeats now being recorded, the historians of 2061 may still get only a glimpse of what we were really like – or at least, who we considered ourselves to be.

Read more: “Forever online: Your digital legacy“