Applied Biosystems, now part of Invitrogen, was the first to pioneer a sequencing-by-ligation process, marketing it under the name of SOLiD. The process has some interesting features and is the only sequencing approach to include a degree of built in error-detection, which can drop its error rates below that of traditional sequencing. But it was first of its kind, far more complex than previous methods, and ABI's own literature on it skipped past key technical details—as a result, it confused most people. When talking with several people involved in genome sequencing, none would let me finish the sentence "I don't understand how SOLiD works..."—they'd all interrupt by saying "Nobody understands SOLiD sequencing."

Fortunately, I have a friend at ABI, and I now understand it. It's really quite clever.

The process starts with a step that's shared by the other two major sequencing techniques, which we'll term tethered PCR. Tethered PCR creates a small population of identical molecules to sequence, but keeps them in close physical proximity so that they can be sequenced as a group.

DNA on a bead

Miniaturization helped radically increase the rate of DNA sequencing, but it has its limits, the primary one being our ability (or even a robot's ability) to accurately measure out the increasingly miniscule amounts of materials needed for ever smaller reactions. Eventually, the accuracy plunges and reactions stop working because they don't have the right ratios of ingredients. The solution to this problem was to create a large reaction mix, and then use a physical-chemical process to split it up into reactions that are smaller than any human could accurately measure.

Two of the techniques (SOLiD and 454) split up the reactions using an emulsion, which you may remember from high school chemistry. When two liquids that are immiscible—oil and water, for example—get put together and shaken, the water will form tiny droplets in the oil. In most cases, these droplets will quickly bump into each other, merge, and grow, separating the liquids out. But, with specific types of oil, it's possible to create a stable emulsion. All you have to do is replace the water with a PCR reaction mix, and a decent sized emulsion will contain a staggering number of miniature PCR reactions, potentially one in each drop.

Of course, that doesn't do you any good if you can't isolate the products contained in each individual bubble when the reaction is done. That's where tethering comes in. Each PCR reaction still contains two sets of primers, but one of them is chemically linked to a tiny bead. As the reaction proceeds and more of this primer gets used, the bead simply gets coated with DNA, keeping all the reaction products in one place, even when the emulsion gets broken up at the end of the PCR process. The other primer is often linked to a chemical (biotin, which will be familiar to the biologically inclined) that can be used to separate beads covered with DNA from those that didn't participate in a reaction.

DNA fragments are ligated onto a primer-coated bead, then subjected to PCR, coating the bead with a series of identical molecules. The other primers used in the PCR reaction have a molecular tag that enables successful reactions to be purified.

It's worth pausing to think about the statistics of the reaction for a bit. Any bead that started in a reaction with two different DNA molecules is useless for sequencing, since it will produce a mix of two signals. As a result, the amount of DNA added to the reaction is controlled so that, on average, very few of the beads start with two or more DNA molecules. At these levels, most of the reactions will be empty, in that they have no DNA to amplify. (These empty reactions are gotten rid of by the purification step described in the last paragraph.)

Setting up things so that the majority of reactions are wasted probably seems like, well, a waste, but there are two things to consider here. First, each reaction is so small that it uses very little material material, so this probably saves on materials in the long run. The other thing is that the approach allows massive numbers of reactions to be set up in parallel with minimal human or robotic intervention. Most of these reactions may be wasted time, but it's still a net win, because the process completes more useful ones that we could do otherwise.

Going long with ligation

As we mentioned in our Complete Genomics story, DNA ligases only pay attention to when the five or six bases on either side of the link they're making match. In practical terms, this means that we can only sequence the five bases closest to the primer. Since we can only distinguish about four fluorescent molecules easily, that means we can only examine one of these five bases at a time—repeated reactions are needed to get data from all five bases.

Complete Genomics got around the five-base limit by simply sequencing the closest five bases, then linking an unlabeled five-base sequence on, and sequencing the next five; they stopped when they had 10 bases of sequence.

SOLiD takes a different approach, with ABI figuring that, if you're ligating some bases on anyway, you might as well use those to extend the reads. But nothing is simple when it comes to SOLiD, and ABI actually uses eight-base DNA fragments for each read step, which apparently increases the efficiency of the ligase reaction.

But it also creates problems. If you read out from that primer, you're only getting information about every eighth position on the DNA, making the data really sparse. You also need to get rid of the label at the end of the DNA fragment to make sure you can read the next one.

The SOLiD technique has a simple solution: lop off the last three bases, and make the label go away with them. In between the fifth and sixth bases, the primers contain a modified phosphate group, with a sulfur replacing one of the oxygens. Adding a bit of AgNO 3 triggers a reaction that breaks the bonds at the sulfur, disconnecting the last three bases, which can be removed with a small bit of heating. When they're melted off, they take the label with them. That leaves a five-base fragment behind, ready for extension.

Each SOLiD reaction involves a series of reactions in which every fifth base is queried (vertical reaction cycle). Five of these series are performed, each with a primer one base shorter than the last, to read all the bases (horizontal reaction sets).

Repeating this process allows SOLiD sequencing to walk down the DNA, getting information about every fifth position. ABI's data shows that the accuracy remains high out to about 35 bases, and some preliminary data suggests it's possible to go even further down the DNA. Once the end of the read is reached, the sequencing hardware strips the newly added DNA off, and starts again with a primer that's one base shorter, which gets information about the next base over. Repeat this a total of five times, and every position on the DNA has been queried.

Double-checking base changes

That process is fairly hard to follow to begin with, but ABI's real mind-bender is how it reads the information, as hinted at above. As we saw with Complete Genomics, using four-color readouts nicely matches the four bases in DNA, allowing a single color to represent each base. That may be the easiest way to think of things, but ABI would argue it's not the best way to actually perform sequencing.

A given color is compatible with four two-base combinations.

If we think of things in terms of information content, we've got four fluorescent tags that are stable and easily distinguishable. Since there are four bases, it's easiest to simply match one color per base. But it's also possible to query two bases at a time—since there are 16 possible two-base combinations, each color has to represent four potential two-base combinations, as shown here.

Now, by itself, this isn't especially useful; you may have narrowed down the identity of a given base, but you've not definitively identified it. SOLiD relies on having another two-base read that partly overlaps the first—obtained from a primer that was one base shorter—to nail a given base's identity definitively. Looking at it in terms of information again, given two overlapping reads, there are 16 possible two-color combinations, spread out over three bases. Part of this information goes to definitively identifying the middle base; the remainder provides some information about both flanking bases, but not enough to identify it definitively. Only additional staggered reads will be able to nail their identities down.

The next issue is that, given the chart shown above and a series of colors, you'll quickly find that there are always four sequences that are compatible with any given color series. So, for example, given the series of colors below, it's possible to assume the first base combination, indicated by blue, is AA. The next combination is red and starts with A, so must be AT, the one after that is red and starts with T... But it's just as likely that the first two bases were something else that causes a blue signal, which changes not only the first two bases, but every base combination afterwards.

A given series of colors is compatible with four different sequences, so knowing the identity of the first base is critical.

The problem comes from the fact that you can never firmly establish the identity of the first base you read using SOLiD data. This may seem a bit useless at first, so it's important to remember that the unknown sequence happens to be sitting next to PCR primers with a sequence that's always known. So, the trick is to always start the sequencing with at least one base of overlap with the known primer sequence. If we know the primer ended with an T, then we know the sequence series must be the bottom one on this list.

Why use this mind-warping method of dealing with DNA's information content? Because a lot of the sequencing that's currently happening involves resequencing the human genome, looking for changes associated with cancer and genetic diseases. In these cases, identifying single-base differences is critical, and having confidence that you're not looking at a spurious sequencing error is essential for drawing any conclusions.

In every other technique, a single base difference will only change the signal at one point in the sequence reading process. In SOLiD, because of the overlapping information, that base change will cause differences in two separate signals, read in completely different ligation reactions. This provides a much higher confidence that any differences seen in SOLiD represent real differences in the underlying DNA sequence; as a result of SOLiD's built in error-catching, ABI estimates an error rate that's well below that of any other system on the market, at least out to 25 bases. Since most sequencing involves several passes at the same sequence, the ultimate error rate for a complete genome is extremely low.

ABI is obviously working on ways to extend the sequence reads but, even in its current form, SOLiD is more than capable of generating sequence that can be used to construct a human genome, given that everything can be aligned to known sequences.

Still, if researchers are willing to accept a higher error rate, there are new technologies that can take them out as far as four hundred bases in a single read. Stay tuned for when we cover the catchily named pyrosequencing.