Scientists are working on creating genome graphs to develop a better understanding of how our DNA influences our lives.

When Benedict Paten stares at his computer monitor, he sometimes gazes at what looks like a map of the worst subway system in the world. The screen is sprinkled with little circles that look like stations. Some are joined by straight lines — sometimes a single path from one circle to the next, sometimes a burst of spokes radiating out in many directions. And sometimes the lines bend into sweeping curves that soar off on express routes to distant stations.

A rainbow palette of colors makes it a little easier to digest the complexity. But if you stare a little too long, vertigo sets in.

This map is not a guide to any city on Earth. It is a sketch of the human gene pool.

advertisement

Sixteen years ago, two teams of scientists announced they had assembled the first rough draft of the entire human genome. If you wanted, you could read the whole thing — 3.2 billion units, known as base pairs.

Today, hundreds of thousands of people have had their genomes sequenced, and millions more will be completed in the next few years.

advertisement

But as the numbers skyrocket, it’s becoming painfully clear that the original method that scientists used to compare genomes to each other — and to develop a better understanding of how our DNA influences our lives — is rapidly becoming obsolete. When scientists sequence a new genome, their reconstructions are far from perfect. And those imperfections sometimes cause geneticists to miss a mutation known to cause a disease. They can also make it harder for scientists to discover new links between genes and diseases.

Paten, a computational biologist at the University of California, Santa Cruz, belongs to a cadre of scientists who are building the tools to look at genomes in a new way: as a single network of DNA sequences, known as a genome graph.

Erik Garrison using vg software (Wellcome Trust Sanger Institute)

The genome graph is only just starting to crack open its cocoon. Paten and his colleagues hope to release the first open-access genome graph, made up of over 1,000 people, within a year. A company called Seven Bridges rolled out a beta version of a proprietary graph earlier this month.

Paten hopes that other scientists will soon recognize the value of the genome graph. At first, the benefits will come in the form of more accurate, more complete genome sequences. As more scientists use the genome graph, though, it will be able to grow, its loops and paths capturing information about millions of people. And then things will get really interesting.

“It will be used everywhere in a multitude of ways we haven’t even imagined yet,” Paten said. “I want to show you your path through the map so you can see how you fit. In many ways, it’s the human story.”

Human genomes are mostly identical to each other. So once scientists assembled one human genome sequence, they began using it as a reference for assembling new ones. Just as geographic maps use longitude and latitude, the reference genome provided geneticists with a coordinate system.

The standard method for sequencing a genome begins with chopping up a person’s DNA into short fragments. Researchers then make many copies of each fragment and then quickly read them all at the same time. The hardest part is figuring out where these fragments — known as reads — originally came from in the genome.

To find that location, scientists can slide each read along the human reference genome, looking for a matching stretch. Every now and then, however, they will find anomalies. The read and the reference genome differ by one or two bases. Those differences are so small they don’t cause much of a headache.

Once the scientists have finished mapping all the reads in a person’s genome, they can scrutinize the variants. If geneticists suspect that a child has a rare genetic disorder, they can check for the mutation that causes it. If that child has a disease new to science, the geneticists may try to find other children with it as well, and look for a mutation they all share in precisely the same spot in the genome.

Mapping to a reference genome works well — except when it doesn’t.

When DNA gets copied, cells sometimes make an error and accidentally chop out a section. The flip side is also true: Mutations can cause a stretch of DNA to be accidentally duplicated.

These insertions and deletions can make it difficult — even impossible — to match a read to the place it belongs in the reference genome. The result is that even the best genome reconstructions have gaps. Insertions and deletions can also lead a computer astray, tricking it into mapping reads to places where they don’t belong. When this happens, errors creep into a genome reconstruction.

Erik Garrison using vg software (Wellcome Trust Sanger Institute)

For years, scientists have made the best of this bad situation. They’ve written software programs that use special tricks to figure out where hard-to-map reads belong. But these programs can take a long time to work through a single genome, and even then they fail to map many fragments.

Yet scientists have also known about a potential solution for a long time. Back in 2002, three University of California, Los Angeles, scientists — Christopher Lee, Catherine Grasso, and Mark Sharlow — argued for getting rid of the reference genome. Instead, they borrowed ideas from a branch of math called graph theory. They showed how it might be possible to represent genomes in a network of DNA sequences. Each person’s genome would become a different path through the same network.

“They’re quite old ideas,” said Gil McVean of the University of Oxford. Yet those ideas lay fallow for years. When scientists could only look at a limited number of genomes, using a reference genome worked well enough to keep scientists happy and busy.

It wasn’t until McVean and other scientists started trying to assemble a lot of genomes that they got frustrated. McVean helped lead a major study called the 1000 Genomes Project from 2008 to 2015. His team of scientists created the biggest catalog of human variation to date.

While they were able to reconstruct most of the genomes accurately, some regions ended up riddled with errors and gaps. “You come across the bits that you know are important, but they’re clearly total rubbish,” said McVean.

Paten and his colleagues were experiencing the same frustrations in their own work on genomes. “We said, ‘Hang on a minute — it doesn’t make any sense for us to use one genome as our common coordinate system for everything we do,’” said Paten. “What we’re doing now is kind of stupid.”

A genome graph should, in theory, be a lot smarter. In 2014, Paten and McVean joined forces with their colleagues to find a way to build one. They are working through the Global Alliance for Genomics and Health, a coalition of 400 research institutions that is trying to build a framework for the worldwide sharing of genetic and clinical information. Paten and McVean are leading a group at the alliance to set up rules for a competition. Teams of scientists are now developing different sets of tools and testing them out.

“We need a system to find out who’s got the best graph,” said McVean.

The goal that these scientists are all moving toward is simple in concept, if hard to make real.

Erik Garrison using vg software (Wellcome Trust Sanger Institute)

Let’s say you wanted to make a graph of two people’s genomes. You can start with a single string of bases that are identical in both of them. As soon as you encounter a spot where they differ by a single base, you can draw a fork, with each path leading to a different base, and then join them to another identical sequence with another fork. To read each person’s genome, all you need are driving directions so that you know which way to turn at each fork. Those forks can also take you to longer sequences inserted into one person’s genome and missing from the other.

Adding a third person’s genome to this network is also simple. You just add extra paths to any unique variants. The more genomes in the network, the easier it gets to add new ones.

And eventually, the network becomes so fleshed out you can use it to do something else: to assemble reads into a new genome. Because the genome graph contains so many complete genomes, you can — in theory — get a much more accurate reconstruction. If a read doesn’t match anything in one genome, it may match one of the others. And once you’ve assembled that new genome, you can add it to the graph, too.

McVean and his colleagues published a proof-of-concept graph last December. Instead of a whole genome, they focused on a region measuring 4.5 million bases long called the MHC region. It contains about 240 genes that are crucial to the immune system, and it’s a notorious nightmare of insertions and deletions. By creating a graph of thousands of people’s MHC region, McVean and his colleagues were able to fill in some of the gaps that earlier attempts couldn’t make sense of.

Most of the graph-building teams in the Global Alliance competition are like McVean’s, based at research centers and publishing all the details of their graphs. But the private sector is getting into the genome graph race as well.

Earlier this month, Seven Bridges, a biomedical data analysis company, delivered a beta version of its own graph to its clients, which include large-scale genomic databases in the United States and other parts of the world.

The US Department of Veterans Affairs has signed an agreement with Seven Bridges to use their tools to analyze their own database, known as the Million Veteran Program. The National Cancer Institute also plans to use the Seven Bridges tools to study tumors.

The company’s founder, Deniz Kural, worked as a grad student on the 1000 Genomes Project. That experience let him see all the problems with using a reference genome. “At some point I got frustrated missing the same variants over and over again,” said Kural.

Kural made genome graphs one of the main goals of the company, where he now serves as CEO. After years of development, its graph tools are now more accurate and often faster than the best software that relies on the reference genome. Kural declined to be more specific until Seven Bridges publishes the details in a peer-reviewed journal.

Seven Bridges has gotten some criticism in the genome graph community, because it’s selling software rather than just building open-source tools. Kural said the investment the company has attracted ($45 million came in this February) will let it reach the goal of large-scale genome graphs first. “To reach that scale, you need the longevity that grant funding ultimately will not be able to provide,” said Kural.

Paten, however, has serious concerns about Seven Bridges. “Their effort is entirely proprietary, and they’re attempting to patent everything, which to me is hugely troubling,” he said. Building large-scale graphs will have to be a collaborative effort, said Paten, one that will require open-source tools. “We need sunlight and transparency.”

Paten, McVean, and other researchers are still building the tools they’re going to use to make the first open-source genome graph. One of the big challenges that remains, they said, is time. A graph of a million genomes may allow scientists to map reads more accurately than before. But instead of searching a single reference genome for a match, they’ll have to search the entire labyrinth of the graph.

“The graphs we’re talking about are not the answer to that problem,” said McVean.

Erik Garrison using vg software (Wellcome Trust Sanger Institute)

Paten’s team is working on a graph of the 1000 Genomes Project, which they hope to release in the next year. “It’s on our roadmap to get it out in the next year,” he said.

The people in that database come mostly from a few big populations, like Han Chinese and Europeans. To make the graph better reflect human diversity, Paten and his colleagues plan to fold in data from the newly completed Simons Genome Diversity Project, which has 300 genomes from 142 populations around the world.

The big question that hangs over genome graphs is whether other scientists will be willing to switch over. “I’m kind of torn about it,” said the Broad Institute’s Daniel MacArthur, who studies how common different mutations are in our species.

The shortcomings of the standard approach are obvious, according to MacArthur. “We know that we miss things as a result,” he said. But he and other scientists have developed a kit of powerful tools for studying genomes against a reference. And they’ve used that kit successfully for years.

MacArthur suspects that shifting from a single reference genome to a graph made up of thousands or millions of genomes would make his work vastly more complicated. “It blows my mind to think about it,” he said.

Yet MacArthur sees the genome graph as one of the most important developments in his field. “This graph-based way of thinking is fundamentally a new way of thinking about the genome,” he said. “I think long-term this is likely to be the direction that the field is heading. But right now, I’m in no hurry.”

The graph builders like McVean and Paten recognize that they’re going to have to persuade people like MacArthur to join their cause. At the moment, they’re working on software that will make the transition as painless as possible, with minimal mind-blowing.

If a large number of genome researchers start using the same graph, Paten said, it will be able to grow tremendously.

“Now we enter completely speculative territory,” he said.

A universal genome graph would do more than just absorb more genomes. It could also store information about who the genome came from — their ancestry, their medical history, and so on. If it got big enough, Paten said, scientists could program the graph to discover hidden links between genes and health.

“We could build a learning system,” Paten speculated. “I think that would be a beautiful thing. It would be transformative.”