Microsoft creates the first automated DNA data storage retrieval system Watch Now

DNA storage systems aren't commercially ready yet, but they hold the potential to store petabytes of data on just one gram of DNA, offering extremely dense storage.

While Microsoft researchers have been working on an automated DNA storage and retrieval system, researchers from North Carolina State University reckon they've cracked another challenge that will arise as high-capacity DNA storage systems emerge in the future.

At present, the techniques used to retrieve data would be overwhelmed by the sheer volume of DNA in future systems where information needs to be retrieved from databases with zetabytes of data.

SEE: Sensor'd enterprise: IoT, ML, and big data (ZDNet special report) | Download the report as a PDF (TechRepublic)

The researchers detail in a new paper that they used "chemical handles to selectively extract unique files from a complex database of DNA mimicking 5TB of data." The data was mimicked because it's still too expensive to order DNA databases that large. They also designed a nested file system that can handle exascale databases.

The researchers call their system DENSE data storage, or DNA Enrichment and Nested SEparation. Existing file-access methods use a random-access technique, which the researchers argue would be overwhelmed by larger databases.

"Two of the big challenges here are, how do you identify the strands of DNA that contain the file you are looking for? And once you identify those strands, how do you remove them so that they can be read – and do so without destroying the strands?" said James Tuck, one of the authors of the paper and an associate professor of electrical and computer engineering at NC State.

Fellow author Albert Keung, assistant professor of chemical and biomolecular engineering at the university, explains that previous systems append 'primer-binding sequences' to DNA strands that store information. The problem is that there are only 30,000 available binding sequences, meaning that systems are limited to using 30,000 file names

"You could use a small DNA primer that matches the corresponding primer-binding sequence to identify the appropriate strands that comprise your desired file. However, there are only an estimated 30,000 of these binding sequences available, which is insufficient for practical use. We wanted to find a way to overcome this limitation."

Their system uses two nested primer-binding sequences. It identifies all the DNA strands containing the initial binder sequence and then runs a search of the group identified to find strands with the second binder sequence. This technique allows for 900 million file names.

The second challenge was to develop an alternative method for extracting the file to read it.

"Existing techniques use polymerase chain reaction (PCR) to make lots (and lots) of copies of the relevant DNA strands, then sequence the entire sample. Because there are so many copies of the targeted DNA strands, their signal overwhelms the rest of the strands in the sample, making it possible to identify the targeted DNA sequence and read the file," the researchers explain.

SEE: Tech budgets 2019: A CXO's guide (ZDNet special report) | Download the report as a PDF (TechRepublic)

Part of the answer to a low-copy solution included attaching molecular tags to the primers being used to find the right DNA strands. They also used "magnetic microbeads coated with molecules that bind specifically to a given tag".

The microbeads latch on to the tags of targeted DNA strands and then are retrieved with a magnet, which brings the selected DNA with them.

"This system allows us to retrieve the DNA strands associated with a specific file without having to make many copies of each strand, while also preserving the original DNA strands in the database," Keung said.

More on DNA storage