Next stop, clay tablets

Publication date: 19 June 2009

Originally published 2008 in Atomic: Maximum Power Computing

Last modified 03-Dec-2011 .

I just spent a while playing with an unusual data storage option.

That unusual storage option is... paper.

But not paper with boring old plain text on it. Paper with arbitrary digital data on it.

Paper is a great format on which to store really important information. Thieves seldom bother to steal it. Magnetic fields or power surges don't damage it. Paper can also tolerate much higher temperatures than any digital storage system. And if those high temperatures are created by a housefire, paper in a simple wooden box, like the bottom shelf of a chest of drawers, is actually very likely to survive.

If your house burns to the ground then any paper not in a very fireproof safe (read the small print before you buy one of those, to see what "fireproof" means to this particular manufacturer...) will of course be gone along with everything else. And metal filing cabinets pass heat through, sacrificing their contents to save themselves. But even if the fire brigade take half an hour to turn up, paper in a wooden bottom-drawer will probably survive a house fire.

And most ordinary cheap printer paper today is acid-free, so fifty years from now it won't be brown and flaky like an old paperback book.

OK, if the roof leaks over your backup, then a flash-drive will probably come out better than paper would. The shelf life of modern flash RAM ought to be at least a few decades, too - but there's no guarantee that the hyperconductive thinking aluminum computers of the year 2075 will have USB ports, or support for current filesystems. Paper could, therefore, actually be more compatible in the future than any of today's conventional data-storage options. Paper is really pretty awesome stuff.

You can fit something in the order of twelve thousand characters of small-but-legible eight-point text on one sheet of A4 paper. At one byte per character, that's 11.7 kilobytes (in the powers-of-two sense) of data per one-sided page, or more than 87 pages per megabyte. You can fit considerably more on if you print it all tiny and squinty, but no human-legible text gives you very good data capacity per page.

You can do a lot better than text, though.

"Matrix codes" are the two-dimensional "square barcodes" that're popping up all over the place these days.

Well, they're usually square. UPS's released-into-the-public-domain "MaxiCode" uses little circles with a bullseye in the middle, Microsoft have as usual invented their own standard, and there's also the very distinctive-looking Palo Alto Research Center "DataGlyphs", the standard version of which encodes data as a rectangle of little slashes and backslashes. The little lines can be printed in different weights and in different colours without changing the data they encode, so you can make a halftone image that contains "hidden" digital data. (For some reason, PARC seem to have abandoned dataglyphs.com and scrubbed all mention of the things from parc.com. If you're reading this a while after I wrote it, perhaps they'll have sorted themselves out.)

All of the matrix codes made for barcode sorts of jobs are, of course, only meant to be used to store bar-code-y sorts of data. This means they usually have hard format limits that make sure the matrix will fit on product packaging, and will be coarse-grained enough to be "scanned" with a low-res camera, like those in cheap mobile phones. The maximum capacity of an alphanumeric QR Code, for instance, is 4,296 characters. Data Matrix tops out at 3,116 characters, and Aztec Code can do 3,067 alphabetic characters with no numbers or punctuation, or 1,914 bytes of arbitrary data.

(For comparison, a standard IBM punched card, such as still survives here and there, has 80 columns of 12 punch locations. That gives a theoretical maximum capacity of 960 bits, or 120 of today's conventional 8-bit bytes, each of which more or less equals an alphanumeric character. In practice this full capacity was unattainable, though, partly because no encoding system supported using every location for user-data storage - 80 characters of user data was actually the most that anybody ever got from an 80-column card - and partly because a card with too many holes punched in it, also known as a "lace card", would jam in the reader. And if you've enjoyed this digression, see also "Rainbow Storage", a bold step forward for information theory into the realm of utter bollocks.)

Let's stick for the moment with the job of storing plain text. English words generally average about 5.5 characters each, plus one for a space or punctuation; that means about 660 words for QR Code, about 480 for Data Matrix, or about 300 for Aztec Code. (Here's a neat online encoder that lets you create a Data Matrix or QR Code.)

A capacity of a few hundred words is actually quite useful, for some kinds of everyday text. Newspaper stories, for instance, commonly come in at less than 400 words. The Sunday paper would be a lot smaller if we were all able to read stories encoded as blocks of dots.

And these capacity numbers are also very approximate. That's partly because of the variability of text, but also because smarter encoding systems - a widely-understood compression system, like the gzip used by some Web servers, for instance - can push capacity up considerably. And, at the same time, error-correction code can push capacity down, but make the data resistant to damage. Many data-matrix systems use Reed-Solomon error correction, and allow you to dial the error-correction content up to 90% or more of the total encoded data. That gives you a lot less space for user data, but makes the data extremely hard to destroy.

Error-correction makes data matrices suitable for another job - "paper keys" for strong encryption.

You probably only need 128 bits of entropy for functionally unbreakable encryption. That's a tiny amount by computer standards, but makes for a fairly cumbersome password or passphrase. If it's OK to turn the key into a physical object, though, you can encode it as some kind of matrix code. You can easily fit 128 bits of key into the area of a postage stamp, and still have room for enough error-correction data to make the key highly resistant to folding, spindling or mutilation.

(You can even tattoo matrix codes on yourself. Persons of ordinary dimensions are likely to find it difficult, not to mention painful, to fit more than a very short message.)

But never mind all that. What about general-purpose backups?

Even if all you want to back up is a few megabytes of accounts data, a system that can only store a few kilobytes per data matrix is useless.

This is a great shame, though, when you realise that fitting two or three kilobytes into a one-inch square means a single sheet of A4 paper could hold at least a couple of hundred kilobytes, even if you include plenty of error-correction redundancy to minimise the chance of silverfish-related data loss.

There's no upper limit to the amount of data you can store as matrix codes, if you've got the space. Look at Dolby Digital and Sony Dynamic Digital Sound movie audio, for instance; they're encoded optically, just like any other matrix code, on on the edge of the film.

200 kilobytes ain't much if you're backing up your whole hard drive. But it's actually pretty decent for a lot of really important files. Financial data, program source code. The novel you're writing. Your university thesis.

There are already at least two data backup utilities that use matrix codes, expanded to cover the whole of an arbitrary number of pages.

One of them is Twibright's "Optar" (OPTical ARchiver), which can reliably pack 200 kilobytes of data onto one laser-printed A4 page. Optar doesn't come as ready-to-go software, though; you have to compile the C source code yourself.

This can actually be a plus, though. If the computing world as we know it, with x86 CPUs and USB ports, still exists when you have to restore your Optar backup, you can just use the same software you compiled last time. And if you package a printout of the 20-odd pages of "unoptar" C source code with your Optar backup, people fifty or a hundred years from now will probably still be able to compile it. People are still working in Fortran and Lisp today (though not always by choice...), and the original versions of those languages are more than 50 years old; C isn't quite middle-aged yet, but I don't think it's a stretch to say that C will still be compilable in 2075, if we're not all busy fighting the rad-zombies for Soylent.

All this is, of course, a bit much for someone who just wants to play with the technology. Fortunately, there's also a ready-to-go free-software Windows paper-backup program, inventively named "PaperBack".

The PaperBack source is downloadable too (C++, this time), so PaperBack is another real option for long-term backups. And it can cram about half a megabyte of data onto a 600dpi A4 page, though I wouldn't trust my cheap laser printer with more than 180k per page.

PaperBack includes compression tuned to work very well with plain text, so it's an ideal solution for backing up written works, program source code, lists of passwords and exported data from your accounting program. I found that with compression turned on, a 746-kilobyte plain-text version of Charles Dickens' A Tale Of Two Cities only took up about one and a quarter PaperBack pages...

...even using my crummy laser printer. Printed as tight-packed eight-point text, it would have been more than sixty pages.

(General-purpose compression like Zip or 7-Zip will give the best results with most files, but PaperBack's compression is clearly better for plain text.)

PaperBack even has built-in encryption, though you can of course also encrypt your data in some other way before backing it up. However you encrypt any backup, you should of course make sure you remember the password, or separately back up the key certificates, or whatever the key for the encryption scheme you're using happens to be. If you don't, encryption can more accurately be called the "delayed Recycle Bin". If your data doesn't need "real" encryption, data-matrix encoding just by itself will stymie casual snoopers.

PaperBack also has error-correction, adjustable from enough for your data to survive the loss of one little square block of dots in every ten, to enough to tolerate the loss of one block in every two. I did my capacity tests with the default one-in-five redundancy, and also tested the correction with a bit of hole-punching and scribbling.

At 180 kilobytes per page, you'll need 5,825 pages of A4 copy paper to back up a gigabyte of data. And a few toner cartridges. And a paper-slave to keep feeding the printer.

All of my passwords and other login info are only 24,074 bytes, though. Even without compression, PaperBack can fit that on an A7 index card.

And ten years of my business accounts zip down to about 2.7Mb. That's only fifteen cheap-laser pages.

Works for me!