A few years back, a company called Oxford Nanopore announced it was developing a radically different way of sequencing DNA. Its approach involved taking single strands of the double helix and stuffing them through a protein pore. With a small bit of current flowing across the pore, the four bases of DNA each created a distinct (if tiny) change in the voltage as it passed through. These could be used to read the DNA one base at a time as it wiggled through the pore.

After several years of slow progress, Oxford Nanopore announced that its sequencing hardware would be as distinctive as its wetware: a USB device that could fit comfortably in a person's hand. As the first devices went out to users, it became clear that the device had some pros and cons. On the plus side, the device was quick and could be used without requiring a large facility to support it. It could also read very long stretches of DNA at once. But the downside was significant: it made lots of mistakes.

With a few years of experience, people are now starting to learn to make the most of the devices, as demonstrated by a new paper in which researchers use it to help sequence a human genome. By using the machine's long reads—in one case, nearly 900,000 bases from one DNA molecule—the authors were able to get data out of areas of the human genome that resisted characterization before. And they were able to distinguish between the two sets of chromosomes (one from mom, one from dad) and locate areas of epigenetic control in many areas of the genome.

In light of all the distinct information it can provide, the machine's error rate is seeming like less of a problem.

Errors and corrections

We have DNA sequencing machines that make very few mistakes. Unfortunately, they're only good for reading DNA in chunks of about 200 bases or so. Computer software has to recognize the cases in which these small chunks overlap and use them to build up larger sequences. This process fails when DNA is repetitive or when very similar sequences show up in multiple areas of the genome—the software simply has no way of telling what goes where.

As we saw with the axolotl genome, it's possible to use longer, error prone reads to make sense of the mess. The high-accuracy method provides sequence, while the longer reads tell us how these sequences link up into larger pieces. There will still be gaps, but they will be fewer, and more sequences will be found in large chunks as opposed to small fragments. While the axolotl genome relied on machines from Pacific Biosystems, the nanopore system would work in this regard, too.

Or at least it should. Part of the goal of the new paper was to confirm this, and a lot of the paper involves figuring out how to get the best sequence possible out of the authors' nanopores. For example, they tried two different software packages to interpret the voltage data coming out of their machine and found that a community-developed, open source package that uses a neural network gave the best data. Combining nanopore reads with shorter, high-quality fragments raised the overall accuracy of the genome assembly to 99.88 percent, which shows that this works.

But the researchers went well beyond that. On its own, the nanopore sequence had an accuracy rate of only 92 percent. When combined, having multiple reads of the same sequence from the same machine boosted the accuracy above 97 percent. A separate software package could then compare cases where different reads disagreed and make a decision as to which were likely to be right; this boosted accuracy up to 99.44 percent. This is not as good as having short, high-quality reads, but it's close enough for many purposes. Adding in the high-quality short reads to this boosted accuracy to 99.96 percent.

The nanopores also offered some very distinct advantages. For example, the activity of genes can be altered by what are called epigenetic modifications—a chemical alteration of some bases that don't change the DNA sequence. These changes also slightly alter the voltage readings as a base passes through, allowing the researchers to identify where they had taken place in the genome.

Going long

We also inherit two copies of each chromosome (excepting the X and Y of males): one from mom and one from dad. While these copies are different, most of the underlying DNA is identical for long stretches, making it impossible for short DNA reads to be used to determine which chromosome is which. As a result, while you can tell where differences are present, it has been impossible to say which differences were inherited together, on the same chromosome. The long reads from nanopores makes this possible.

Finally, the researchers decided to make the reads as long as possible. DNA is a long, thin molecule, and manipulating solutions of long DNA tends to break it into small fragments, as the motion of the fluid will create shear and stretching forces. If you're really careful, however, these can be minimized. When the authors took these precautions, the typical read length provided by the nanopore machine shot up to more than 100,000 bases; one read reached 882,000.

That was large enough to cover some gaps left behind by the original project that sequenced the human genome. One of these was 50,000 bases long and included a duplication of a gene. Another had eight copies of a repeated sequence in rapid succession. Over time, it should be possible to use this approach to really bring the genome to completion.

The work, however, did identify some shortcomings on the sequence side. A common file format used to hold DNA data, for example, isn't specced to handle sequences this long. Because of that, some analysis software couldn't work with the nanopore reads at all. As a result of these compatibility problems, the team was forced to rely on a very processor-intensive algorithm for some of their analysis.

The impressive results that did come out of this analysis suggests that it will be well worth putting the effort into getting the software up to date.

Nature Biotechnology, 2017. DOI: 10.1038/nbt.4060 (About DOIs).