The bottleneck in genomics is reference content, not the speed of your annotator.

Today, Google announced that they are using their considerable computational infrastructure to look up genomics data in public datasets. They can take a genome and “annotate” it by joining the data with publicly available reference datasets. In under a minute, they can sift through 88 Gigabytes of human DNA sequence variants. This is a big step for genomic infrastructure. Computing speed and power is critical for scaling up genomics beyond a few leading hospitals, labs, and companies.

Here’s a sample of the result. Clearly, speed isn’t our problem:

You see all that data? Yea, me neither. Most positions on the genome return null. In other words: genome in, garbage out. You can do this really quickly, but it won’t help interpret your DNA in the clinic. It’s kind of like Googling for answers in a world with no websites.

We still have a long way to go. We need to “fill in the nulls”. Otherwise, we’ll have no idea what the genome means. Since potentially life-defining decisions are based on genomic interpretation, this is a big deal.

Running genome annotation using the same unreliable data over and over again won’t yield a better result. Doing it quickly won’t yield a better result either. If we are to solve genomics we need to start taking data exchange, provenance, and curation more seriously. Our understanding of the genome is only as good as the underlying reference codex that we use.

What genomics needs is better data. Data you can trust.