written by: Ran Levi

In the 1980’s, the British BBC invested millions of pounds on what should have been a technological marvel: a modern version of the famous medieval Domesday Book. Less then 15 years later, it’s system was unusable. Compare that expensive failure to the longevity of the Domesday Book: a record written on paper in Latin in the 11th century and is still readable today. What can these two case studies tell us about the challenges and potential solutions to Digital Preservation?

This Article is available in Audio as a Podcast. Subscribe to access the MP3 file: iTunes | Android App | RSS link | Facebook | Twitter

Explore episodes in other categories:

Astronomy & Space | Biology & Genetics | History | Information Technology | Medicine & Physiology | Physics | Technology

How do you quantify all of the information that everyone on earth produces in a year?

IDC is an American research firm that specializes in telecommunications and information technology. And each year, it conducts a comprehensive survey to assess just that — they try to quantify all the books, pictures, sound files, videos, articles — all the information produced on earth in one year. It’s difficult to know how reliable their numbers are – and IDC has been criticized in the past for inaccuracies – but at the very least, IDC’s research gives us a rough estimate for the volume of information that all of the humanity produces each year.

For example: in 2005, IDC estimated that the total volume of human information was 130 exabytes. And just to put that into perspective, one byte, the basic unit of digital information, is equivalent to a single letter. And an Exabyte is ten to the power of eighteen bytes. So, if, for example, an average episode of CMPOD is about fifty megabytes of information – then one exabyte is twenty-three million years of continuous listening to the show…

In 2012, the total volume of information was 2,800 exabytes. That’s twenty times the volume of information in 2005. This means that the total volume of human information is more than doubling each year. In the age of digital cameras, computerized word processing and personal blogs, it’s easier than ever to create vast amounts of new information. In 2020, IDC estimates, data volume will reach 40 Zettabytes or 40 thousand exabytes.

Future Archaeology

Today’s archaeologists are excavating the ground, trying to reconstruct a reliable picture of what life was like in ancient times. It is not an easy job: the amount of information available to archaeologists is minimal. At first glance, the wealth of information we create today should be a boon to archaeologists and historians of the future: they should have no problem to understand who we were and what we thought. After all, we document our lives in countless ways – from movies to blogs to podcasts like this one.

But nothing is as simple as it seems. As we shall soon see, the wealth of digital information we produce – thousands upon thousands of exabytes – will create new and unique challenges to these future archaeologists.

The Domesday Book

In 1066 Duke William of Normandy invaded the British Islands, defeated the Anglo-Saxon King Harold Godwinson, and became William The Conqueror, King of England. But after taking the throne, King William now had to deal with the financial fallout of the invasion: he had soldiers to pay and forts to build, and these all costs money. The kingdom’s main source of income was then, as it is now, taxes paid by the citizens – but the post-war chaos meant that the Treasury officials were no longer sure how much should each citizen pay.

If you are a human and are seeing this field, please leave it blank.

So in 1086 William decided to launch an ambitious operation of unprecedented scale. Surprisingly, it was not a military operation but rather a statistical one. He sent officials to roam the length and breadth of England and document as accurately as possible the land that each citizen owned, their income and their profession. All that data was bound into two volumes, which together were known as “The Domesday Book”. The name ‘Domesday’ is a derived from the Christian Doomsday or Judgment Day. On Judgment Day, according to tradition, each person shall be judged by his good and bad deeds, and this judgment will be final and unchallengeable. Similarly, information recorded in the Domesday Book, which determines the taxes you owe, would be final and could not be challenged.

Amazingly, The Domesday Book has survived to this day: If you have contacts in the British National Archives and a solid knowledge of Latin, you can scroll through it and read the contents. For historians, the Domesday Book is a godsend: no other historical document of that time so systematically describes a moment in the life of an entire country.

The BBC’s Domesday Project

In 1986, the BBC decided to celebrate the 900th anniversary of this unique book and embarked on an ambitious operation of its own. The BBC’s “Domesday Project” was to be the modern equivalent of the ancient Domesday Book: a comprehensive attempt to capture a moment in the life of the British nation. A million people, mostly school children from across the UK, wrote about their daily lives, their communities, and their cities. All in all, The BBC collected 150,000 pages of text, 20,000 photos and hundreds of maps, statistics, and videos. All this information was saved in digital form to be browsed with relatively inexpensive Personal Computers, for the enjoyment and education of future school children.

After a comprehensive review of the available storage technology, the project leaders decided to use the LaserDisc to store the information. Back then, it was cutting edge technology: a golden disc, the size of a vinyl record. They stuffed all the articles, maps and videos onto two such discs: the first was called the “Community Disc” and it consisted mainly of articles and photos from around the UK. The second was the “National Disc” which contained things like maps, graphs, statistical data and photos taken by professional photographers.

A major goal of the project was to allow users to have easy access to the information stored in the discs, and for this purpose, the BBC developed a special user interface software which enabled the user to locate a particular article or photo using keywords, a menu, the location on a map and more. Since PC were only in their infancy, the BBC also developed a dedicated LaserDisc player that was able to read the two disks and display their content on a screen.

A different Reality

The ambitious project was completed on time and within budget…but was ultimately a failure. The kit containing the LaserDisc player and the two discs sold for approximately 5000 pounds – a hefty sum that only very few schools and public organizations could afford to pay. As a result, only a few copies of the Domesday Project Kit were ever distributed to the public, and the project failed to recoup its investment.

Nevertheless, everyone who took part in the BBC’s Domesday Project was proud of it. They all felt that the data collected for the project would be as important and useful for future historians as the original Domesday Book is useful for historians of our time.

But an article published in 2002 in the British newspaper The Observer revealed a very different reality. Only Fifteen years after the project was concluded – the information stored on the discs was practically inaccessible. Despite the two and a half million pound investment in the digital Domesday Project, no high school student or curious citizen can browse the articles or view the pictures.

The reason, as you may have guessed, is that the LaserDisc technology did not survive very long. It lost out to the smaller Compact Disc, and it didn’t catch on with the general public. Manufacturers stopped producing LaserDisc readers, and by 2002 only a handful of the BBC’s Domesday Project kits were usable, preserved mostly in museums and archives. The irony of the situation was perfectly captured in the Observer’s article:

“By contrast, the original Domesday Book – an inventory of eleventh-century England compiled in 1086 by Norman monks – is in fine condition in the Public Record Office […], and can be accessed by anyone who can read and has the right credentials. ‘It is ironic, but the 15-year-old version is unreadable, while the ancient one is still perfectly usable,’ said computer expert Paul Wheatley. ‘We’re lucky Shakespeare didn’t write on an old PC.’”

The Challenges of Information Preservation

The story of the Domesday Project represents the challenges we face as we try to preserve digital information. These challenges can be divided into three broad categories.

The first is the preservation of the digital storage device, like the LaserDiscs themselves. The second is the preservation of the systems that read the information — this is the LaserDisc reader in the Domesday Project Kit. The third challenge is the preservation of the software that reads the stored binary information and translates it into photographs, letters or sounds that we humans can understand.

Of course, these challenges are applicable to almost all existing information storage methods. In DVDs, for example, the preservation might include the DVD discs, the DVD players and the software used to decode the stored data. It’s also worth mentioning that most of those same challenges apply to the preservation of analog information – like music on vinyl or cassettes.

Preserving The Players And Readers

So, let’s begin with the challenge of preserving the player and readers that allow us to access the digital data. Paper books can only contain a small amount of information: dozens to several hundreds of thousands of words, and some pictures or diagrams. Discs, chips, and other modern storage devices can hold a lot more information: Complete encyclopedias, movies, albums. But the ability to store such large amounts of digital information in such small spaces comes at a cost: you’ll always need a machine to read that information and translate the binary code. To read the text written on a piece of paper, all you need is eyes – but to read text stored on a USB drive, for example, you need a silicon chip to access the memory cells and read the information they hold.

If you are a human and are seeing this field, please leave it blank.

These chips and devices can be quite complex, but surprisingly, this potentially complicated technical issue is probably the easiest challenge we’re facing. Once you understand how zeros and ones are represented inside a USB drive, for instance – it’s relatively easy to build a machine to able to access the content. Yes, it might take some hard work and serious expertise – but future archaeologists should be able to solve these kinds of technical problems. BUT, the rapid obsolescence of readers and players still poses a threat to the actual preservation of the data itself. Why is that?

Natural Decay

Well, every storage media we have is ultimately vulnerable to natural decay. Hard disks, for example, are extremely sensitive to mechanical wear: A typical hard disk lasts three to five years on average before it’s motor fails. CDs and DVDs, especially those burned at home and not in a factory, can hold onto their data for ten or fifteen years. We can extend the life of a disc by storing it in optimal conditions — low humidity, low temperature – but can we be certain that our children and grandchildren will do the same? Can we be sure that a fire or flood wouldn’t destroy it? And even if the disc somehow survives all these potential dangers – still, long-term processes like slow oxidation or the gradual dissipation of magnetic fields would eventually destroy the information on it.

This, of course, is not a new problem: even the best quality paper disintegrates eventually. The solution has always been to Replicate the data – backing it up by creating new copies of it. A lot of the books that survived from antiquity were copied by hand by dedicated monks. But our ability to copy and backup the digital information depends largely on the availability of the devices that mediate between us and the storage media. The BBC fell victim to an unfortunate choice of storage technology: the LaserDisc, which didn’t last very long. But it’s hard to blame the BBC’s engineers: many technologies have emerged and then disappeared in the last thirty years. Many of us still have old cassettes, floppy disks, VHS tapes and other obsolete storage devices that are slowly decaying in the back of a drawer.

And if, for example, you don’t have a VHS player at home, backing up old tapes means you have to go to a special lab and pay to convert them to DVD. The sad truth is that many of us don’t bother to do that, and then the tapes just rot away. In other words, to make sure all this stuff survives, we need to replicate it – but without those readers and players, the backups will probably never be created in the first place.

Solving The Problem of Media Decay

So, How can we solve the problem of media decay?

One possible strategy is a better replication solution: backing up the information in a way that would make it less prone to accidental destruction or decay. An interesting backup solution which emerged in the early 2000’s is the cloud. Companies that offer those cloud backup services – like Google and Amazon – understand the importance of reliability, and invest billions in setting up huge data centers around the world, equipped with advanced air conditioning systems, generators for alternative power supply and so on. These companies are better equipped to deal with replicating and storing massive amounts of data like images and videos, than the average computer user at home.

Still, Cloud Storage is not a magic bullet. For example, one of the dangerous and error prone processes is a software upgrade in the data center. Such an upgrade is almost always performed during routine operations and without any downtime for the customers. A Google engineer once compared this delicate and complex process to replacing the tires of a car traveling on a highway at 90 mph…

And indeed, Googe saw two such failures in 2009, when a few thousand Gmail mailboxes were accidently deleted during a software upgrade. Fortunately, Google was prepared for such a possibility: all the users’ information was backed up on magnetic tapes, and was restored within a few hours. Some of Amazon’s customers, however, weren’t so lucky: in 2011, the company announced that a technical glitch caused the loss of 0.07 percent of all the information stored in one of its data centers. 0.07 percent does not sound like such a big number, but it’s actually hundreds of thousands of gigabytes. In other words, cloud backup is probably a good solution for data replication – but it’s not bullet-proof.

A different strategy for handling media decay is creating a more durable storage method. One such new and promising technology is actually quite an ancient one: DNA. This double helix molecule which exists in every living cell is the perfect medium for information storage, a patent perfected by nature over billions of years. One gram of DNA molecules can hold two terabytes of information – twice the capacity of an average hard disk – and it can do so for tens of thousands of years, under the right storage conditions. For example, scientists have been able to extract intact genetic information from remains of wooly mammoths that died during the last ice age. The fundamental technology for using DNA as the storage medium already exists: in 2011, a group of scientists demonstrated the successful storage, and later extraction, of several dozen text documents, images and audio files in a DNA molecule.

The fact that DNA is the media on which all life forms keep their genetic information plays in our favor in another way: DNA is so universal, that there is little doubt that any future human society with reasonable technological ability will have the means to read it.

The Software Problem of Information Preservation

So, let’s get back to the BBC’s Domesday project. When the general public became aware of the dismal state of the project several volunteers and academic researchers embarked on various preservation and rescue efforts.

As you might recall, in addition to the two LaserDiscs, the Domesday Project Kit also contained a LaserDisc reader. Some of these readers were still functional, and the information stored on the discs could be read with relative ease. But it was here that the restorers encountered the third, and maybe the most complicated challenge of digital preservation: the software problem. What is the software problem?

Let’s say I’ve got some tomatoes and an onion. I have all these ingredients here on the table – but no recipe book. There are many possible ways to use these ingredients and many possible dishes. How do I know which is the right one?

This is exactly the same problem we are facing with software preservation. When you read the binary data from a DVD or Hard Drive, for example, you end up with a long long list of binary bits – ones and zeros. These bits have no inherent meaning by themselves: it is up to us, humans, to give them this meaning. We need to agree on the proper way to decode these bits: for example, we could agree that 1001 in binary is the decimal number ‘9’. Without this pre-agreed interpretation – 1001 might mean the letter A, or a pixel on the screen – or maybe something else entirely.

In many cases, this pre-agreed interpretation is standardized and well known. For example, an MP3 file has a defined structure, so any software which implements the MP3 standard can read and decode the bits in the files as sounds. But what happens if the encoding and decoding schemes are not standardized? Well, in that case, all we can hope for is that someone, somewhere, has kept a record of decoding scheme.

Saving The Domesday Project

In the case of the Domesday Project, there were still a few working LaserDisc readers that could run the software and decode the data stored in discs. The problem was that there were only very few such readers, mostly in museums, and this meant that the general public had no practical way to access the collected data. Running the same software on a regular, modern computer, was not a viable option either. This software was written in a programming language called BCPL, which was an advanced programming language for its time – but it went the way of the LaserDisc, and today’s computers can’t read it.

A few volunteers took it upon themselves to try and retrieve the data stored on the two LaserDiscs. In 2004, an amateur programmer, Adrienne Pearce, succeeded after much effort, to reconstruct some of the decoding algorithms. He extracted a large part of the texts and images from the discs and uploaded the information to a website he created. Two other programmers, Eric Freeman and Simon Guerrero, with the help of the engineer Andy Finney who was part of the Project’s original development team in the 80’s – located the original magnetic tapes used to store the Domesday data. These tapes held higher-quality versions of the data. Freeman and Guerrero collaborated with the BBC to create a website called “Domesday Reloaded”, which allows visitors to browse the project’s data.

However, Pearce, Freeman, and Guerrero were not able to fully restore the original software’s navigation system of menus and windows, so the website’s visitors still could not enjoy the same user experience they would have had with the original Domesday Project Kit.

The Importance of the User Interface

Now, There are those who will see the user interface as unimportant. After all, the articles, images, and graphs are the heart of the project, aren’t they?

They certainly are – but we shouldn’t underestimate the importance of the user interface as well. Imagine our future digital archaeologist, a thousand years from now, discovering the ancient remains of a first generation iPhone. This would be, without a doubt, an important discovery that would shed light on the “smartphone revolution” of our generation. But what if the archaeologist manages to extract only the data on the iPhone, like the photos and messages, but not the phone’s user interface software? Most likely, he’d scratch his head, trying to figure out what the big deal was.

So, how can we solve the problem of software obsolescence? Well, there are two basic strategies we can employ.

Software Preservation

The first is Migration: periodically convert the information from the “old” and outdated format to a newer format, so that modern software tools can handle it. For example, an audio file can be converted from the now-obscure 3GP format to the more modern MP3 format. When the times comes for the mp3 to be replaced, we’ll convert the files to the new format – And so on, ad infinitum. In essence, migration ‘bypasses’ the problem of software preservation by allowing us to use existing software instead of an old and problem-ridden one.

On the surface, migration looks like a good conservation strategy – but it also has its disadvantages. The most obvious one is that migration deals with data – but not with user interfaces, or it sometimes omits what is called the ‘Metadata’. Metadata is ‘external’ data which describes the main data we are trying to preserve. An MP3 file, for example, also contains metadata such as the name of the song, artist, and album. Depending on the migration method used, this metadata may or may not survive – and without it, the data can lose much of its original meaning.

A second, less obvious disadvantage is the danger that some of the original data will be lost during the migration process. MP3, for example, is a ‘lossy’ format – that is, it discards some of the sounds found in the original recording, sounds that the human ear can’t detect anyway. This means that some of the original recording data will be lost during the conversion from 3GP to MP3. After several migrations from one format to another, we might discover that a significant portion of the original data was lost forever – much like how a copy of a copy of a copy of a photo does not look as good as the original photo did.

Software Emulation

A second, and possibly better alternative is Emulation: a technology that creates a ‘virtual environment’ inside a real computer. A software running in the virtual environment will not be able to tell the difference: if your virtual environment fully mimics the original environment the software is supposed to run in – the software will function as it should. In the Domesday Project’s case, this means creating a virtual LaserDisc reader, so that when the software wants to read new data from a disc, it receives this data from the virtual reader.

This approach works especially well with old computer games. for example, you can easily find emulators on the web that will allow you to play very old games which were created for the Spectrum, Commodore, and other ancient computers – on your personal machine.

A third group of volunteers tried to save the Domesday Project. It was called CAMiLEON: they were a team of academics from the U.S. and the UK. The CAMiLEON group took the emulation approach: they created a virtual environment mimicking the old hardware that no longer exists in the real world. This way, the group managed to recreate the original experience of the Domesday Project, including using the navigation menus, browsing for photos, videos, and so forth.

Unfortunately, funding problems didn’t allow the CAMiLEON project to reach maturity, and it was aborted in 2004. Still, the emulation approach may have the best potential for long-term preservation of digital information. The big disadvantage of emulation is that the creation of a virtual environment is a complicated matter that requires a good measure of programming skills and intimate knowledge of the original device that you’re trying to emulate, so it also requires a relatively large investment of time and money. However, experience shows that emulation does work: all it takes is a few dedicated developers to create an emulation environment so that millions of users can enjoy it all over the world.

Digital Dark Ages

You might be surprised to learn that final problem facing the Domesday Project is not a technical one – but a legal one. Almost all of the preservation efforts focused on only a single disc – the ‘Community Disc’, which holds the photos and articles sent by the general public. The second disc, the ‘National Disc’, contains maps, graphs, and photos taken by professionals. This content is protected by copyright laws, and there’s always the fear that someday someone will sue the conservationists for copying the contents without proper permission…This problem won’t bother our future historians: after all, the copyright will expire in ninety years or so. But if replication, migration, and emulation are not possible today due to the legal limitations – there’s a very real chance that the information won’t survive into the future at all…

So, Digital preservation is not a simple matter. There are multiple challenges facing those who are trying to preserve the data we create for the sake of future generations: some are technical – such as the preservation of the storage media and software – and some are legal or financial. If we fail to face these preservation challenges, there is a risk that our current period will be regarded by future historians as the “Digital Dark Ages”, since there will be relatively few surviving texts, images, and videos from our time.

It’s also likely that future archaeologists will use tools that are very different from the ones used by today’s Emulators and sophisticated computers will replace brushes and pickaxes. These future archaeologists might find themselves missing the good old days of working on a real excavation site, instead of sitting in front of a computer all day…

And what about us, the common people? Many of us enjoy browsing old photo albums, looking at old black and white pictures of grandpa and grandma when they were young and beautiful. So if you want your grandchildren and great-grandchildren to see how good you looked when you were younger – your best bet? Find a good printer, and some quality photo paper….the sooner the better.

If you are a human and are seeing this field, please leave it blank.

Ran Levi is a Science author and co-host of Curious Minds Podcast. Read More…