GATK tries to solve the problem with a lot of statistics. DNA-sequencing machines sometimes make mistakes, so the GATK team studied where the machines tend to made mistakes. (The letters GTG are particularly error-prone, to give just one example.) They thought long and hard about things like “the statistical models underlying the Hidden Markov model,” per DePristo. GATK then gives its best guess for the actual letter at a certain location in DNA.

DeepVariant, on the other hand, still does not know anything about DNA-sequencing machines. But it has digested a lot of data. Neural networks are often analogized as layers of “neurons” that deal in progressively more complex concepts—the first layer might respond to light, the second shapes, the third actual objects. As DeepVariant is trained with data, it learns which connections between “neurons” to strengthen and which to ignore. Eventually, it can sort the actual mutations from the errors.

To fit the DNA-sequencing data to an image-recognition AI, the Google team came up with a work-around: Just make it an image! When scientists want to investigate a mutation, they’ll often pull up the aligned snippets, like so:

Google

“If humans are doing this as a visual task, why not present this as a visual task?” says Poplin. So they did. The letters—A, T, C, or G—got assigned a red value; the quality of the sequencing at that location a green value; and which strand of DNA’s two strands it is on a blue value. Together, they formed an RGB (red, green, blue) image.

Google

And then it was simply a matter of feeding the neural network data. “It changes the problem enormously from thinking super hard about the data to looking for more data,” says DePristo.

Between publishing a preprint about DeepVariant last December and the release this week, the team continued improving the tool. Instead of three layers of data—represented by red, green, and blue—at any location in the genome, DeepVariant now considers seven. It would no longer make any sense as an image to the human eye. But to a machine, what’s just a few more layers of numbers?

To be clear, DeepVariant itself is unlikely to change genetics research. It is better than GATK, but only slightly so—and it is half as fast depending on the conditions. It does, however, lay the groundwork for AI’s influence in future genetics research.

“The test will really be how it can translate to other technologies,” says Manuel Rivas, a geneticist at Stanford. New sequencing technologies like Oxford Nanopore are becoming popular. If DeepVariant can quickly learn variant calling under these new conditions—remember the humans took five years with GATK—that could speed up the adoption of new sequencing technologies.

DePristo says that the idea of layering data on top of each location in the genome could easily be applied to other problems in genetics—the more important of which is predicting the effects of a mutation. You might imagine layering on, for example, data on when genes are active or not. DeepVariant started off with just three layers of data. Now it has seven. Eventually it might be dozens. It won’t make much sense to a human brain anymore, but to an AI, sure.