A new method of data storage that converts information into DNA sequences allows you to store the contents of an entire computer hard-drive on a gram's worth of E. coli bacteria...and perhaps considerably more than that.


The idea of storing data inside bacteria has been around for about a decade. Even very simple bacteria have long strands of DNA with tons of bases available for data encryption, and bacteria are by their nature far more resilient to damage than more traditional electronic storage. Bacteria are nature's hardiest survivors, capable of surviving just about any disaster that would finish off a regular hard drive. Besides, bacteria's natural reproduction would create lots of redundant copies of the data, which would help preserve the integrity of the information and make retrieval easier.

Preparing traditional data for storage inside bacteria is simple enough. There are four DNA bases that can be used to make up the DNA strings: adenine, cytosine, guanine, and thymine. That basically means we're working with a four number system, also known as quaternary numbers.


In a presentation on their breakthrough, the Hong Kong researchers showed how to change the word "iGEM" into DNA-ready code. They used the ASCII table to convert each of the individual letters into a numerical value (i=105, G=71, etc.), which can then be changed from base-10 to base-4 (105=1221, 71=0113, etc.). Finally, those numbers can be changed into their DNA base equivalents, with 0, 1, 2, and 3 replaced with A, T, C, and G. And so iGEM becomes ATCTATTGATTTATGT.

Once the raw data is ready, the researchers say a few algorithms can be used to weed out redundant and repetitive information. That doesn't just save a ton of space - lots of repetition in the DNA sequence can actually be biologically harmful to the wellbeing of the DNA and bacteria, so this step rather neatly solves two problems at once.

DNA strands aren't long enough to store complicated information like a photograph or a book, so the best available solution is to fragment the data into lots of little pieces and spread it among the different cells. To make that work, the researchers have to create a system that allows the fragments to identified and ultimately put back in the right order. So they created a three-part structure for all the DNA: header, message, and checksum.


The header is an 8-base-long sequence that is divided into four levels of identifying information - zone, region, area and district - which allows each fragment to be put back in the right order. After the message carries the actual usable data, the checksum provides a repetition of the original header, which is useful in controlling for minor mutations to the bacteria.

So, let's say the information has been encrypted and placed in lots of different cells of bacteria. How then does someone retrieve the data on the other end? The decrypter would take the DNA and run it through what's known as next-generation high-throughput sequencing, or NGS. This particular type of sequencing analyzes and compares multiple copies of the same sequence and then uses majority-voting to figure out which bases are correct if parts of the data have decayed. Then the compression algorithms could be reversed to restore the raw data to its original form.


The last step would be snapping the fragments back together in the correct order so that the DNA strands could be translated back into useful data. This is where we go from just data storage to data encryption. The person trying to read the data needs a formula that will reveal the right order of the headers and checksums - without that formula, the data remains meaningless.

That's the theory - how about the application? Well, let's hear straight from the researchers themselves:

This rci-system is feasible in DH5-alpha strain of E. coli, as supported by extracted plasmid DNA size. It is found that the size of the DNA extracted is consistent with that of DNA stored in the plasmid before extraction. There is no loss of DNA, implying that no large deletion has occurred during the experimental procedure. In the first trial, we encoded a short message in a single vector, together with two inverted repeats. We designed primers which targets the encoded message either in normal orientation or reverse-complementary orientation. Both sets of primers could be used to generate PCR products, indicating that encoded message exists in both recombinated and normal forms. Sequencing results confirmed the correctness of the PCR product.


The possibilities of this biotechnology are truly amazing. A single gram of E. coli cells could hold up to 900,000 gigabytes (or 900 terabytes) of data, meaning these bacteria have almost 500 times the storage capacity of a top of the line commercial hard drive.


Indeed, my best hard drive is a 1.5 terabyte drive that weights just about exactly one kilogram. If I had that hard drive's weight in storage bacteria, I'd have 900 petabytes of storage space that could sit unobtrusively in the corner of my desk. Of course, we don't know yet the precise practical applications - it's quite possible this will remain strictly used for complex encryption work.

Now, there does seem like one potential concern with using E. coli to store data: isn't E. coli dangerous? It appears there's not too much to worry about there - the researchers used non-virulent strains of the bacteria, and the bacteria can't do much more than store the data and reproduce. The DNA sequences that represent the data are total gibberish when it comes to encoding potentially dangerous proteins.


For more, check out the researchers' website and presentation on their new biotechnology.