It's rare for a month to go by without some aspect of DNA sequencing making the headlines. Species after species has seen its genome completed, and the human genome, whether it's from healthy individuals or cancer cells, has received special attention. A dozen or more companies are attempting to bring new sequencing technology to market that could eventually drop the cost of sequencing down to the neighborhood of a new laptop. Arguably, it's one of the hottest high-tech fields on the planet.

But, although these methods can differ, sometimes radically, in how they obtain the sequence of DNA, they're all fundamentally constrained by the chemistry of DNA itself, which is remarkably simple: a long chain of alternating sugars and phosphates, with each sugar linked to one of four bases. Because the chemistry of DNA is so simple, the process of sequencing it is straightforward enough that anyone with a basic understanding of biology can probably understand the fundamentals. The new sequencing hardware may be very complex, but all the complexity is generally there to just sequence lots of molecules in parallel; the actual process remains pretty simple.

In a series of articles, we'll start with the very basics of DNA sequencing, and build our way up to the techniques that were used to complete the human genome. From there, we'll spend time on the current crop of "next-generation" sequencing hardware, before going on to examine some of the more exotic things that may be coming down the pipeline within the next few years.

The basics of copying DNA

A short stretch of DNA

Anyone who's made it through biology knows a bit about the structure of the double helix. Half of one is shown above, to illustrate its three components: its backbone is made up of alternating sugars (blue) and phosphates (red), and each sugar is linked to one of four bases (green). In this case, all of the bases shown are adenine (A), although they could be potentially be guanine (G), cytosine (C), or thymine (T). In the double helix, the bases undergo base pairing to partners on the opposite strand: A with T, C with G.

When a cell divides and DNA needs to be replicated, the double helix is split, and enzymes called polymerases use each of the two halves as a template for an new opposing strand; the base pairing rules ensure that the copying is exact, except for rare errors. Historically, DNA sequencing has relied on the exact same process of copying DNA—in fact, the enzymes that make copies of DNA within a cell are so efficient that biologists have used a modified polymerase to perform sequencing.

Adding a base to DNA

In the animation shown at right, a string of T's is base paired with a partial complement of A's on an opposing strand. The DNA polymerase, which isn't shown, is able to add additional nucleotides (a sugar + base combination) under two conditions: they're in the "triphosphate" form, with three phosphate groups in a row, and they base pair successfully with the complementary strand. As the red highlight indicates, the polymerase causes the hydroxyl group (OH) at the end of the existing strand to react with the triphosphate, linking the two together as part of the growing chain. When that reaction is done, there's a new hydroxyl group ready to react, allowing the cycle to continue. By moving down the strand and repeating this reaction, a new molecule of DNA with a specific sequence is created.

From copying to sequencing

From a sequencing perspective, having a new copy of DNA isn't especially helpful. What we want to know is what the order of the bases along the strand is. Sequencing works because we can get the process to stop in specific places and identify the base where it stops.

The simplest way to do this is to mess with the chemistry. Instead of supplying the DNA with a normal nucleotide, it's possible to synthesize one without the hydroxyl group that the polymerase uses to add the next base. As the animation here shows, the base can be added to the growing strand normally, but, once in place, the process comes to a crashing halt. We've now stopped the process of DNA replication.

Stopping DNA polymerization

Of course, if you supply the polymerase with nothing but terminating bases, it will never get very far. So, for a sequencing reaction, researchers use a mix of nucleotides where the majority are normal but a small fraction lack the hydroxyl group. Now, most of the time, the polymerase adds a normal nucleotide, and the reaction continues. But, at a certain probability, a terminator will be put in place, and the reaction stops. If you perform this reaction with lots of identical DNA molecules, you'll wind up with a distribution of lengths that slowly tails off as fewer and fewer unterminated molecules are left. The point at which this tailing off takes place is dictated by the fraction of terminator nucleotides in the reaction mix.

Now we just need to know what base is present when the reaction stops. This is possible by making sure that only one of the four nucleotides given to the polymerase can terminate the reaction. If all the C's, T's and G's are normal, but some fraction of the A's are terminators, then that reaction will produce a population of DNA molecules that all end at A. By setting up four reactions, one for each base, it's possible to identify the base at every position.

There are only two more secrets to DNA sequencing. First, you need to make sure every polymerase starts copying in the same place, otherwise you'll have a collection of molecules with two randomly located ends. This part is easy, since DNA polymerases can only add nucleotides to an existing strand. So, researchers can "prime" the polymerase by seeding the reaction with a short DNA molecule that base pairs with a known sequences that's next to the one you want to determine.

The other trick is that you need to figure out how long each DNA molecule is in the large mix of reaction products that you're left with. The negative charge on phosphates makes this easy, since it ensures that DNA molecules will move when placed in an electric field. So, if you start the reaction mix on one side of an aqueous polymer mesh (called a gel) and run a current through the solution, the DNA will worm its way through the mesh. Shorter molecules move faster, longer ones slower, allowing the population of molecules to be separated based on their sizes. By running the four reactions down neighboring lanes on a gel, you'll get a pattern that looks like the one below, which can be read off to determine the sequence of the DNA molecule.

DNA sequencing. Given a supply of DNA molecules and primers, the polymerase makes a series of fragments that stop when a terminating base is incorporated. The fragments appear as bands in one of the four lanes that run across the gel at bottom.

Going high(er) throughput

We're now at the state of the art from when I was a graduate student back in the early 1990s and, trust me, it was anything but artful. The presence of the DNA, marked by those dark bands, came from a short-lived radioisotope incorporated into the nucleotides. That meant you had to collect everything involved in the process and pay someone to store it until it decayed to background. The gels were flexible enough that they would shift or bend at the slightest provocation, making the order of bases difficult to read. But not so flexible that they wouldn't tear if suitably disturbed. All told, it took a full day to create something from which, if you were lucky, you could read two hundred bases down each lane, making each gel good for about a kilobase of sequence.

The human genome is about 3 Gigabases—clearly, this wasn't going to cut it, and people were beginning to discuss all manners of exotic approaches, like reading single molecules with a scanning-tunneling microscope.

Fortunately, a couple of changes breathed new life into the old approach. For starters, people got rid of the radioactivity by replacing it with a fluorescent tag. Not only was this a whole lot more convenient, but it enabled a simple four-fold improvement in throughput. Go to any outdoor event, and the glow sticks should indicate that it's possible to craft molecules that fluoresce in a variety of different colors.

By picking four fluorescent molecules that are decently spread out—blue for G's, green for A's, Yellow for T's and red for C's, for example—and linking them to a specific terminating nucleotide, it's possible to link the termination position with the identity of the base there. What once required four separate reactions could now be run at once in a single solution.

The next trick was to get rid of most of the gel. As we noted above, molecules work their way through the gel based on their size, but you needed a long gel if you wanted to image a lot of them at once. The solution, it turned out, was not to image them at once—something that, before the switch from radioactivity to fluorescence, wasn't really possible.

All you really need is just enough gel to separate things out slightly. You can put a gate at the end of the gel and image the fluorescent activity there. One by one, based on their size, the different molecules will pass through the gate, and glow a specific color based on the base at that position. Instead of a couple hundred bases, it was now possible to get about 700 bases of sequence from a single reaction. Thanks to digital imaging, the data, an example of which is shown below, was easy to interpret. Sequences came as a computer file, ready to be plugged into various analysis programs.

The data generated by fluorescent DNA sequencing.

With all of these in place, DNA sequencing was ready to for the same sorts of processes that revolutionized many areas of technology: automation and miniaturization. Instead of a grad student or technician painstakingly adding everything that was needed into individual tubes, a robot could dispense all the reaction ingredients into a small plastic plate that could hold about 100 individual samples. A second robot could then pull the samples and deposit them into a machine that read out the sequencing information. Large gels were replaced by narrow capillaries.

The new sequencing machines could do all of this for many samples in parallel, and the larger sequencing centers had dozens of these machines. As the bottlenecks were opened wider, the human genome project shot past its planned schedule, and a flood of genomes followed.

But with the increased progress came increased expectations. Ultimately, researchers didn't just want to have a human genome, but the ability to sequence any human genome, from an individual with a genetic disease to the genome of a cancer cell, in order to personalize medicine. That, once again, has set off a race for new and exotic sequencing technology. We'll discuss the first wave of these so-called "next generation" sequencers in a future installment.