With the newest DNA sequencing technology starting to reach the market, we're seeing a bit of a bifurcation. Some of the methods can do long reads, covering hundreds of bases, and provide data that's appropriate for assembling a genome that's never been sequenced before. Others produce lots of shorter reads, which can only be aligned to a genome that we know the sequence of already. What good is repeating a completed genome? Potentially quite a lot, if that genome happens to be human and, more particularly, yours, since it can provide information on medically relevant issues like disease risks and drug efficacy. The goal here is to make this so cheap that sequencing a person's genome could be routine.

A big step in that direction may have been taken by a company called Complete Genomics, which describes the methods it used to sequence three human genomes in a paper that will be released by Science today. The system described in the paper combines some clever variants of well known molecular biology techniques to read massive amounts of DNA fragments that are, in total, about 65 bases long. But, because the materials used for the reactions are so common, even the enzymes can be purchased cheaply. That allows Complete Genomics to bring an entire human genome in while spending less than $5,000 on materials. All that, plus an error rate of less than one base in 100,000.

For comparison, the completion of Jim Watson's genome, done just a few years ago, is estimated to have cost $20 million.

Building DNA "nanoballs"

Both our introduction to DNA sequencing and our tutorial on DNA ligation will be invaluable for understanding the technique. There are a lot of subtleties to it, but the basic outlines aren't actually that hard to follow.

The first steps in preparing the DNA for sequencing involve linking short stretches of DNA with a series of known primers in a small circle. The authors start by taking a complete genome, and using ultrasound to fragment the DNA into linear pieces that, on average, are about 450 base pairs long. These are ligated to a known stretch of DNA, amplified by PCR, and then looped into a circle by ligating the ends of the PCR products. Now, they've got a known bit of DNA that they can start sequencing from linked to their unknown genomic DNA.

Here's where the clever bit starts. The company's technique can only sequence a short distance from primer sites, so they simply insert a few more known sequences a short distance away from the one they ligated in place. There are a number of proteins, called restriction enzymes, that cut DNA at known sequences. A subset of these (which I'd previously considered pretty useless) don't cut at the sequence they recognize, but at a very specific distance away from the recognition sequence. So, the authors include recognition sequences for an enzyme called Acu I in their insert, and use it to open their loop about a dozen bases away. This lets them ligate in an additional sequence precisely 13 base pairs from the first.

They then repeat the process in the opposite direction, bringing the DNA circles up to 3 inserts. With that, they switch to EcoP15 I, and add a fourth and final insert, closing a circle that now has four pieces of known sequence separated by two 13 and two 26 base long inserts of genomic DNA.

DNA ligase is used to close the human genomic DNA into a loop that incorporates DNA with a known sequence. That sequence directs a restriction enzyme to open the loop, allowing a different known sequence to be inserted. By repeating this process, a small loop with four known sequences is created.

Rolling circles

At this point, however, the process has left us with a big mix of single molecules to work with. Sequencing one, or even a few, DNA molecules is always challenging because of the signal-to-noise issue: with fewer molecules, you get less signal, and errors creep in when that signal gets swamped by noise. So, the authors amplify their single molecules using a process called rolling circle replication.

If you recall from our first installment, a polymerase supplied with a primer and some nucleotides will start copying the DNA until it runs off the end of a linear piece. But here, we're dealing with a circular piece—there's no end to run off—and the polymerase will quickly loop around to where it started, running into the double-stranded DNA that it had recently produced. At this point, most polymerases will stop.

Most, but not all. Some viruses have polymerases that will simply separate the two strands of DNA, and keep copying away, going around the loop until they run out of nucleotides to add (it's a great way to make lots of viruses quickly). In the hands of Complete Genomics, rolling circle replication gets them lots of copies of their sequences of interest, making sure that they can pick the signal out of the noise.

In rolling circle replication, the DNA polymerase runs around a circular piece of DNA, displacing its previous work, and producing many copies of the original circle, linked together as a single molecule.

Even more importantly, all these copies are chemically linked together; there's no need to do anything special to ensure that they're all physically kept in proximity. The authors call these structures "DNA nanoballs." They distribute them on a plate etched using photolithography (the DNA sticks to places that are etched) in order to separate them sufficiently for imaging.

Sequencing by ligation

So far, we've discussed using DNA polymerase for sequencing reactions, but it's also possible to perform sequencing using DNA ligases, a technique pioneered by ABI's SOLiD technique. The idea is that a specific enzyme, T4 DNA ligase, will link two neighboring pieces of DNA, but only if they're base paired properly. So, the reaction will work when the sequence matches perfectly, but even a single mismatch will cause it to fail, or at least will be very inefficient in comparison.

It's possible to use this sort of proofreading activity to identify sequences. So, given a mix of DNA fragments and a primer, ligase will preferentially link the one that precisely matches the sequence next to the primer. This preference extends for about six bases past the site at which the link is being made; beyond that, mismatches are tolerated.

In theory, it should be possible to label every possible combination of five bases, and see which one gets ligated. But there are over 1,000 possible five base sequences, and we simply don't have the ability to identify 1,000 different labels. So, the trick to sequencing by ligation is to only query a single base at a time using four labels.

That can be accomplished by synthesizing different sets of DNA fragments, where each set contains each of the potential 1,000+ five base sequences. In the first set, every fragment that starts with A gets a yellow fluorescent label, ones that start with T get a red one, and so on. In the next set, the yellow label gets put on the fragments with an A in the second position, red for ones with T at that location, etc.

A set of DNA fragments where different molecules contain each of the possible bases in each location (represented by an N). For Set 1, any fragment that starts with an A gets a yellow label, T gets a red, etc. For the next set, the same applies to the second base—five sets are need to cover all five bases.

You can then use these sets to work your way down the sequence one base at a time. Given set one, the ligase will incorporate any fragments that match perfectly. Once the reaction mix is washed away, the fluorescent signal will identify the base at position one. The ligated DNA can then be stripped off the DNA nanoballs, and a new primer and fragment set 2 added, allowing the next base to be read.

Given a primer and the first set of labeled DNA fragments, DNA ligase will preferentially link those fragments that are exact matches to the sequence. When the excess fragments are washed away, a fluorescent signal can be read. The newly ligated DNA can then be removed, allowing the process to be repeated with the next set of fragments.

There are four separate pieces of genomic DNA present, and it's possible to read five bases in both directions from the inserts, so it's possible to produce 40 bases of sequence in total just using this method.

Complete genomics lives up to its name

It's not entirely clear, but some of the language in the paper suggests that the authors attempted to build a human genome using these 40 base reads, and failed. So they extended each read by five more bases by adding an additional step, namely ligating on an unlabeled five-base fragment before performing the labeled ligation. This gets them 10 bases in each direction from each insert. Accounting for overlap within some of the inserts, this nets them a total of 66 bases from each DNA nanoball.

But, because each of these nanoballs was so small, it's possible to do massive amounts of sequencing in parallel, simply by washing different solutions across the surface they were stuck to. The authors were able to get anywhere from 45- to 87-fold coverage of the genome—meaning that, on average, each base in the genome had been sequenced 45 or 87 times, respectively. That's just an average, of course, as chance will ensure that some sequences are under- or overrepresented. Still, given an algorithm designed to work with these specific results and the reference human genome sequence, Complete Genomics was able to get excellent coverage of the genome.

On average, small differences between genomes appear about once every thousand bases, so the error rate of the technique becomes rather important; you want to identify the real differences, and not get sidetracked by spurious ones. The authors calculate their false positive rate at only one for every 100,000 kilobases. We talked with the Broad Institute's Co-director of Genome Sequencing, Chad Nusbaum about the technique in September. He's seen some preliminary data from Complete Genomics, along with a draft of the paper, and described himself as "very optimistic about what we'll be able to do with their data."

Right now, however, it's clear that Complete Genomics is a work in progress. The authors indicate they tweaked some of their reaction conditions between genomes two and three, which improved the evenness of their coverage, so there may be further improvements possible. There are some indications in the text that the process isn't fully automated yet, so there may still be some stumbling blocks there. Finally, it's important to emphasize that we don't (yet) know how to assemble a genome sequence from 66 base fragments, so the technique will only work when there's a high-quality reference sequence to tell us where each of these fragments belong.

But, really, none of that should detract from the eye-popping $4400 dollar figure, which is the cost of the chemicals and enzymes consumed during the process. That doesn't cover all the facilities and personnel costs but, once this thing is automated, I'd expect that a complete genome will be less than $10,000, possibly significantly less.

Science, 2009. DOI: 10.1126/science.1181498