Could deep learning help paleontologists and geneticists hunt for ghosts?

When modern humans first migrated out of Africa 70,000 years ago, at least two related species, now extinct, were already waiting for them on the Eurasian landmass. These were the Neanderthals and Denisovans, archaic humans who interbred with those early moderns, leaving bits of their DNA behind today in the genomes of people of non-African descent.

But there have been growing hints of an even more convoluted and colorful history: A team of researchers reported in Nature last summer, for instance, that a bone fragment found in a Siberian cave belonged to the daughter of a Neanderthal mother and a Denisovan father. The finding marked the first fossil evidence of a first-generation human hybrid.

Unfortunately, it’s very rare to find such fossils. (Our knowledge of Denisovans, for instance, is based on DNA extracted from a mere finger bone.) Many other ancestral pairings could easily have transpired, including ones that involved hybrid groups from earlier crosses — but they might be practically invisible when it comes to physical evidence. Clues to their occurrence may instead survive only in some people’s DNA, and even then, they may be subtler than the signs of Neanderthal and Denisovan genes there. Statistical models have helped scientists infer the existence of a couple of these populations without fossil data: For example, according to research published in late 2013, patterns of genetic variation in ancient and modern humans point to an unknown human population having interbred with Denisovans (or their ancestors). But experts believe these methods inevitably overlook a great deal, too.

Who else contributed to today’s genomes? What did these so-called ghost populations look like, where did they live, and how often did they interact and mate with other human species?

In a paper published last month in Nature Communications, researchers showed the potential for deep learning techniques to help fill in some of the missing pieces, pieces that experts may not have even been aware of. They used deep learning to sift out evidence of another ghost population: an unknown human ancestor in Eurasia, likely a Neanderthal-Denisovan hybrid or a relative of the Denisovan line.

The work points to the future usefulness of artificial intelligence in paleontology, not only for identifying unforeseen ghosts but also for uncovering the very faded footprints of the evolutionary processes that have shaped who we’ve become.

The Search for Subtle Signatures

Current statistical methods involve examining four genomes at a time for shared traits. It’s a test of similarity, but not necessarily of actual ancestry, because there are many different ways of interpreting the small amounts of genetic mixture it uncovers. For instance, such analyses might suggest that a modern-day European shares certain traits with the Neanderthal genome but not a modern-day African. But that doesn’t necessarily set in stone that those genes came from interbreeding between the Neanderthals and the ancestors of Europeans. The latter, for instance, could have instead bred with a different population, one closely related to Neanderthals but not the Neanderthals themselves.

We just don’t know, because in the absence of physical evidence to indicate when, where and how those ancient hypothetical sources of genetic variation might have lived, it’s difficult to say which of many possible inferred ancestries is most probable. The technique “is powerful because of its simplicity, but it leaves a lot on the table in terms of understanding evolution,” said John Hawks, a paleoanthropologist at the University of Wisconsin-Madison.

The new deep learning method is an attempt to do better, by seeking to explain levels of gene flow that are too small for the usual statistical approaches, and by offering a far more vast and complicated range of models to do so. Through training, the neural network can learn to classify various patterns in genomic data based on what demographic histories most likely gave rise to them, without being told how to make those connections.

This use of deep learning can uncover ghosts we didn’t even suspect. For one, there’s no reason to think that Neanderthals, Denisovans and modern humans were the only three populations in the picture. According to Hawks, there could very well have been dozens.

Jason Lewis, an anthropologist at Stony Brook University in New York, shares that view. “Our imagination has been constrained by our focus on living people or on the fossils we’ve found from Europe, Africa and western Asia,” he said. “What deep learning techniques can do, in a strange way, is refocus the possibilities. The approach is no longer limited by our imagination.”

The Real Value of Simulated Histories

Deep learning might seem like an unlikely solution to paleontologists’ problem because such methods normally need massive amounts of training data. Take one of its most common applications, as an image classifier. When experts train a model to, say, identify images of cats, they have thousands of pictures they can train it with, and they themselves know if it’s working because they know what a cat should look like.

But the dearth of relevant anthropological and paleontological data available forced researchers who wanted to use deep learning to get clever, by creating data of their own. “We were kind of playing dirty,” said Oscar Lao, a researcher at the National Center of Genomic Analysis in Barcelona and one of the study’s authors. “We could use an infinite amount of data to train the deep learning engine, because we were using simulations.”

The researchers generated tens of thousands of simulated evolutionary histories based on differing combinations of demographic details: the number of ancestral human populations, their sizes, when they diverged from one another, their rates of intermixing and so on. From those simulated histories, the scientists generated a massive number of simulated genomes for present-day people. They trained their deep learning algorithm on these genomes, so that it learned which kinds of evolutionary models were most likely to produce given genetic patterns.

The team then set the artificial intelligence loose to infer the histories that best fit actual genomic data. Eventually, the system concluded that a previously unidentified human group had also contributed to the ancestry of people of Asian descent. From the genetic patterns involved, those humans were themselves probably either a distinct population that arose from the interbreeding of Denisovans and Neanderthals around 300,000 years ago, or a group that descended from the Denisovan lineage shortly afterward.

This isn’t the first time that deep learning has been used in this way. A handful of labs in the field have been applying similar methods to address other threads of evolutionary investigation. One research group, led by Andrew Kern at the University of Oregon, has used a simulation-based approach and machine learning techniques to differentiate between various models of how species, including humans, evolved. They found that most of the adaptations favored by evolution don’t rely on the emergence of beneficial new mutations in populations, but instead on the expansion of genetic variants that already existed.

The application of deep learning “to these new questions,” Kern said, “is yielding exciting results.”

Hype Versus Hope for the New Tool

Of course, there are big caveats. For one, if actual human evolutionary history does not resemble the simulated models on which these deep learning methods are trained, then the techniques will produce incorrect results. That’s a problem Kern and others have been trying to tackle, but a lot of work remains to be done to provide greater assurances of accuracy.

“I think AI is overhyped in applications to genomics,” said Joshua Akey, an ecologist and evolutionary biologist at Princeton University. “Deep learning is a fantastic new tool, but it’s just another method. It’s not going to solve all the mysteries and complications that we want to learn about in human evolution.”

Some experts are even more skeptical. “My judgment is that the density and quality of the data are not ideal for much other than thoughtful and intelligent nonartificial analyses,” David Pilbeam, a paleontologist at Harvard University and the Peabody Museum, wrote in an email.

Still, in the opinion of other paleontologists and geneticists, it’s a good step forward, something that could be used for predictions about possible future fossil discoveries and expected genetic variations that should have existed among humans from thousands of years ago. “I think that deep learning is going to really give population genetics a boost,” Lao said.

The same could be true for other fields in which we have access to data but not the process that produced it. Around the same time that Kern and other population geneticists and evolutionary biologists were developing simulation-based AI techniques to address their questions, physicists were doing so to figure out how to sift through the tons of data produced at the Large Hadron Collider and other particle accelerators. Geological research and earthquake prediction methods have also started to benefit from these kinds of deep learning approaches.

“Where this leads, I really don’t know. We’ll have to see,” said Nick Patterson, a computational biologist at the Broad Institute of the Massachusetts Institute of Technology and Harvard University. “But it’s always good to see new methods. We’ll use anything we can if it seems to be good at answering the questions we want to answer.”

This article was reprinted on Wired.com.