Knock knock, Neo: digital code now melds with the biological.

This post was co-authored by Alice Chang of CyberReboot.

Glow-in-the-dark plants, editing humans and NOW what?

BioQuest readers likely know that biotechnology is shaping our lives now as profoundly as information technology has since the 1960’s. Molecular biologists, using the increasingly powerful tools of genome engineering and synthetic biology, are designing and building organisms that will feed us, fuel vehicles, cure diseases and produce fine chemicals and new materials, soon at an industrial scale. As I discussed in my last blog post, even our non-biological data is beginning to be stored in DNA molecules, where it may last for hundreds of centuries.

Some observers of the blistering pace of advances in synthetic biology threw up their hands (if not their dinners) in dismay last month when they read the latest: now scientists have built DNA molecules that can hack a computer. Reactions have ranged from “terrifying” to “unrealistic”. But…can you really hack a computer with a piece of DNA?

Well, yes, really…that is, if you target DNA analysis software that was written with scant attention to security, and that created further vulnerabilities in that software…and if you used a very short piece of malware…and picked an exploit that is easy to encode and synthesize in DNA…

In other words, the DNA synthesis task was mundane, and the malware was simple. However, if neither the biology nor the hack are impressive, is there reason for concern? We discuss here why this experiment is noteworthy and what it portends for the future of computer security.

Out looking for trouble…

Peter Ney and colleagues at Tadayoshi (Yoshi) Kohno’s lab at the University of Washington have made careers out of looking for trouble. More precisely, their interest in computer security and data privacy motivated them to seek and demonstrate vulnerabilities in networked computers that run cars, “smart house” appliances, fitness trackers and many other systems that affect our lives, often without our notice. In all these technology sectors, Kohno’s group has found vulnerabilities: opportunities for the insertion of malicious code that can disable our cars, collect data from our homes, and even alter the function of implanted medical devices. As a result, software engineers have taken notice and security has (hopefully) improved across what we now call the “Internet of Things”.

…and lately finding it in vulnerable DNA processing software.

The group turned their attention to the world of DNA sequencing and analysis. Since the discovery of DNA and its function, biology has become increasingly an information science. The cost of sequencing, or “reading” DNA has plummeted by many orders of magnitude in the last 10–12 years. We are therefore now awash in a veritable ocean of DNA sequences, and powerful computers are required to capture, assemble, and analyze this enormous and growing data set.

Until recently, this pursuit had been largely academic, devoted to learning how DNA sequences shape and control living things, and how changes in DNA sequence cause both natural variations in, say, humans, but can also lead to disease. However, as more and more sequence data accumulates, and its practical applications become increasingly valuable, the motivation to steal or manipulate sequence data is becoming stronger. The increasing use of synthetic biology to manufacture valuable materials and chemicals, identify people, solve crimes and create patentable medical products makes DNA sequences an attractive storehouse of value, and a target for manipulation.

The problem is that the security of the software being used to analyze DNA hasn’t kept up with the accelerating pace of DNA sequencing and analysis. Much of the bioinformatics software now in use was adapted from software written for academic purposes, which was created before there was a general awareness of the significant need to protect both the tools and the intellectual property they help create.

DNA, at one level, is just a physical store of coded information that can be read by a sequencing instrument and stored in a computer. Kohno’s group set out to test whether malicious code could be inserted into a computer by encoding it in a physical form, an actual DNA molecule. After all, his group has inserted test malware into systems by many other means: through unprotected web forms, wireless connections (Bluetooth, WiFi, Zigbee), QR code scanners, documents delivered by email, and any other means by which data can flow into a computer.

How DID they hack a computer with DNA?

In computer security, attacks often leverage a software flaw, or “vulnerability,” using a toolset, or “exploit,” to achieve a desired outcome. The group set out to encode an exploit in a DNA molecule so that when the DNA was sequenced, the encoded exploit would become computer code that would be executed and allow the “hackers” to assume control of the computer. Scientists use a series of programs in sequence to collect, assemble, and analyze DNA. To save effort, Kohno’s group chose to exploit a program early in this pipeline that handles small stretches of DNA sequence. That meant the malware they encoded had to be short, within the size of a single fragment of a DNA molecule that can be sequenced without the need for further assembly (about 300 bases, or characters, long).

A short exploit also allowed the synthesis of the malware into a single short DNA molecule. As it happens, current DNA synthesis is also constrained to short molecules (100–200 characters in length). So, to exploit a program early in the pipeline, malware had to be both written into, and readable from, a single molecule’s sequence without the need for further assembly with other fragments of code (more on this below). A single short DNA molecule is also inexpensive to order from a synthesis company, in their case, well under $500.

The group encountered difficulties along the way. Some DNA sequences are harder to synthesize into molecules than others. With current methods, repetitive sequences (say, 20 copies of “GATC” in a row) are difficult to make with high accuracy. (Typically, the desired accuracy needs to be well over 99.9%, to ensure that all your molecules carry accurate code.) Likewise, long runs of the same character (such as “TTTTTTTTTTTTT”) are also difficult to make. Malware, however, often contains such repeats, so the group had to both adapt their malware design, and as it turned out, rewrite part of the target software program, to permit a successful attack.

A tube-ful of malware (picture from Ney et al.)

The exploit

Despite the novelty of the molecular medium, the attack itself is an age-old classic in the computer security field — exploiting a program’s lack of input validation to force the program execution down an unintended path. In fact, their initial attack was modeled after the buffer overflow attack first made famous in hacker Aleph One’s Phrack magazine article from 1996, “Smashing the Stack for Fun and Profit.”[1] In this attack, the submitted input is larger than the memory allocated to the input, so it effectively overwrites the next instruction in the program’s execution path and starts a command line shell for the attacker’s use. In constructing the attack, Aleph One (and many subsequent attackers, following his convention) utilized a block of repeating characters called a NOP-slide to direct the program execution toward the attack’s next instruction.

As mentioned in the section above, blocks of repeated characters are challenging to synthesize into a DNA molecule, and challenging to sequence in a single “read” as well. Even the most compact versions of this attack are too long for the current state of DNA technology. While programmatic techniques may have overcome the repeated block structure, the length issue proved to be unavoidable, and the team opted for a different attack strategy that ultimately proved successful.

Their second attempt utilized the return-to-libc attack method, which effectively bypassed use of an overly long NOP-slide by repurposing existing linked library functions for the attacker’s use (in this case, opening a network connection to a server and redirecting input and output to it). The result was console access on the vulnerable machine. This second attack was encoded in 43 bytes, or 176 bases of DNA, well within the current capacity of DNA sequencers.

To get the exploit to work, however, the authors still needed to use handicaps in their computing environment. First, they disabled common operating system security features such as stack canaries and address randomization, which could have complicated their attack strategy. (Indeed, one of their recommendations for mitigating risk entails utilizing the security mechanisms commonly available on modern computing platforms.) Second, they modified the target program (an existing open-source program) to introduce a more easily accessible vulnerability. However, they noted that the vulnerability they introduced is not uncommon in open source utilities downstream of the sequencer. They merely bypassed the vulnerability discovery phase in order to focus on answering the question of practicality of using a molecular medium as an attack vector.

Given these handicaps, why do we care?

The medium is the message: a new set of “attack surfaces”

While neither the biology nor the malware in the study were remarkable by itself, the authors illustrated yet another technology arena rife with previously unconsidered vulnerabilities. As they point out, the “attack surfaces” include physical and virtual means of malicious code insertion. The physical ways require encoding malware into DNA sequence, synthesizing the molecules, and sending them to be sequenced by a target lab. These physical means of inserting malicious code via DNA molecules include:

Physical sample contamination. Hackers could introduce malware-containing DNA into clinical, forensic, or environmental samples, so that the contaminating DNA is co-purified with the sample and then sequenced.

Hackers could introduce malware-containing DNA into clinical, forensic, or environmental samples, so that the contaminating DNA is co-purified with the sample and then sequenced. Multiplexed Sequencing. Hackers could introduce malware-containing DNA sequence into a legitimate sample by co-sequencing them in a lab. In order to increase throughput, scientists mix DNA samples and sequence them in one batch. DNA from each sample has a unique sequence tacked onto one end of its DNA fragments, so that after sequencing, individual reads can be binned into data files for each sample (again, using software). Nevertheless, DNA sequencers sometimes wrongly assign sequences from one sample into the data file of another sample sequenced in the same batch. This “sample bleed” is small: hundreds of reads, of 100–300 characters each, out of millions of reads total per sample. When Yoshi’s team co-sequenced their malware molecules with a genomic DNA sample, sure enough, a handful of malware molecule sequences were found in the co-sequenced DNA data file (and vice versa — some genomic sequences were found in the malware data file as well). This phenomenon makes it possible to intentionally insert malware or other DNA-encoded information into an otherwise pristine DNA sample with which it is co-sequenced.

Of course, DNA sequence data doesn’t need to be embodied in a physical molecule in order to be used as a means of data input for a malware attack. Other means by which DNA-encoded malware could be used to attack bioinformatics programs include:

Publicly accessible databases. Sequence data is routinely added to public sequence databases, such as GenBank, by users who upload their sequences through a freely available portal. Any user who downloads data containing sequence-encoded malware and processes it with programs lacking proper security controls faces risk of attack.

Sequence data is routinely added to public sequence databases, such as GenBank, by users who upload their sequences through a freely available portal. Any user who downloads data containing sequence-encoded malware and processes it with programs lacking proper security controls faces risk of attack. DNA synthesis services. Just as DNA molecules are submitted to commercial and academic facilities for sequencing, DNA sequences are also designed by synthetic biologists and submitted to “foundries” to be converted into DNA molecules. These sequences could also contain malware that targets the computational platform of the DNA synthesis facility.

What does this mean for the synthetic biology community?

The number of bioinformatics programs now in use is rather large, because of the historically bespoke nature of bioinformatics programming. The number of potential vulnerabilities across the DNA synthesis and sequencing workflow is therefore also large. Attacks via DNA molecules are currently constrained by the same problems faced by the University of Washington team regarding what sequences can be effectively written into DNA, and limited by the need to insert the molecules into the physical DNA sequencing pipeline. However, our ability to both sequence and synthesize DNA is improving. The physical limitations discussed here will likely be reduced or eliminated over the next five years or so. Of course, these constraints are relieved entirely now if malware is simply encoded into sequence and uploaded to a database. If DNA-encoded malware is downloaded from a public database and processed by vulnerable software in a user’s analysis, an exploit may be successfully executed.

What do we do?

While industry leaders (quoted here) and the authors themselves believe the threat remains remote, the entire synthetic biology enterprise, both academic and industrial, is now on notice that their systems could be vulnerable. Their publication is timely, as the economic value of products of synthetic biology will only increase with time, motivating hackers to access, modify, or steal intellectual property that is often protected as trade secret rather than by patent. Software developers in bioinformatics need to analyze their current toolkit, fix vulnerabilities, and adopt coding best practices. In short, if software security did not matter in biology previously, it does now. It is time to take notice, because synthetic biology is becoming fully fledged industrial biology, and genomics now extends into the realm of information science and storage.

Is this Norton’s next cybersecurity product? (Image courtesy of freepik.com)

The ironic twist (of the double helix…)

I have occasionally heard or read the phrase “biological virus”, which to my ear sounds like “wooden lumber”. Of course, viruses are biological — it’s the computer “virus” that is the metaphor. Well, until now, it seems. Eventually one will be able to write into DNA molecules malware that not only opens an exploit but then copies and distributes itself: a true computer virus, embedded in a real virus genome. Could such a virus then also direct its own synthesis in a DNA foundry, as well as in a biological host? It’s still science fiction for now, but Ney, Kohno and co-workers have taken the first step in bridging the functional information gap between self-perpetuating code that works in both silicon and carbon. Stay tuned…the information security and biology communities have just collided.