Thousands of viruses are out there in the wild, circulating in animal hosts like chickens and only gaining attention when they infect people.

With machine learning, researchers get new clues in the hunt for the source of mysterious viruses

When a new virus crops up in people, health authorities face an urgent question: Where did it come from?

Thousands of viruses are out there in the wild, circulating in animal hosts and only gaining attention when they infect people. Viruses can make that jump in various ways — sometimes through direct contact, sometimes via an intermediary like a mosquito or tick. But researchers don’t have great tools to quickly determine the reservoirs that house the viruses or the “vectors” by which they were transmitted.

On Thursday, researchers unveiled a new system, based on machine learning models, that identifies patterns in the genomes of viruses to offer a hypothesis about their hosts and vectors.

advertisement

The system, which was described in a paper published in Science, remains fairly crude for now; it can tell you that a virus likely resides in bats, for example, but not which species. And it’s not entirely accurate: For known viruses, it was able to identify the general type of vector 90.8 percent of the time and host reservoir type 71.9 percent of the time.

But its designers say they hope it can steer disease detectives in the right direction as they race to respond to enigmatic viruses.

advertisement

“It’s always been a challenge to know how you do a search in a more informed way,” said Daniel Streicker of the University of Glasgow, the senior author of the paper. “Our hope is that these models give people at least one new piece of information, which is maybe we should start here.”

The study focused on RNA viruses, more than 200 types of which are known to infect humans, from SARS to Zika to Ebola. Every year, scientists uncover a few new species that threaten human health. The fear is that as human populations and cities grow, there will be more encounters with wildlife and an increased risk of viral spillover.

Understanding which animals have spread those emerging viruses is crucial for surveillance and stemming transmission.

“We’d want to know where that risk came from, and we’d want to know it quickly,” said Mark Woolhouse, a professor of infectious disease epidemiology at the University of Edinburgh, who was not involved in the research.

To develop the new system, the researchers created datasets with information about the genomes of viruses with known vectors and reservoirs.

First, they built a model based on the fact that genetically similar viruses are more likely to have similar hosts and vectors. Though it’s not guaranteed, Streicker said, “if you are a virus, and all of your most closely related viruses are found in rodents, then it’s a pretty good bet that the host for that new virus will be a rodent.”

Secondly, they studied so-called genomic biases. Different combinations of the letters that make up our DNA (or, in the case of these viruses, RNA) still produce the same protein building blocks. And for reasons that scientists don’t understand, a virus’s bias to choose one combination over another actually relates to the bias shown by its host, which allows scientists to tease out virus-host patterns.

With those datasets in hand, the researchers built a machine-learning model. The idea is that if scientists discover a new virus, they can sequence its genetic information and then hypothesize what kind of host it lives in, if it relies on a vector to spread, and, if so, what type of vector.

For now, the model can only say which of 11 classes of animals is likely the host, and which of four vector types (mosquito, midge, sandfly, or tick) serves as the vector. Outside experts said it’s a start.

“We need to identify species rather than broadly defined ‘types’ such as sandflies or rodents,” Woolhouse wrote in a commentary also published in Science Thursday. “Even then, empirical confirmation would still be necessary.”

In an interview, Woolhouse added that he hoped the model was “the first step of many,” indicating that future iterations could improve on this initial system.

Other experts said even a general hypothesis could prove useful. Peter Daszak, a disease ecologist and president of the nonprofit EcoHealth Alliance, who was not involved in the new research, said that if disease detectives investigating a mysterious outbreak could wager that cases of encephalitis were being caused by a virus found in, say, bats, it could still inform how they responded even if they did not know which species of bat.

Newsletters Sign up for Daily Recap A roundup of STAT's top stories of the day. Please enter a valid email address. Privacy Policy Leave this field empty if you're human:

The model’s limitations are a reflection of all that scientists don’t know about the world of viruses. The researchers only had information from a few hundred viruses to build their model, when there are untold thousands out there that remain beyond our awareness and understanding.

It’s not known how many of those viruses would be able to infect humans, and the new model wasn’t designed to predict whether a virus is pathogenic — that is, disease causing.

But the prediction program arrives as scientists have embarked on ambitious efforts to catalogue viruses around the world, including an endeavor called the Global Virome Project, in which Daszak’s group is participating. Researchers aim to get ahead of future viral threats before they strike.

As scientists uncover new viruses, those details could also be fed back into the machine-learning program to enhance its accuracy and specificity, the researchers said. After all, the more genomic information it has from the more viruses, the more precise its predictions could be.

“This is the way we’re going to be doing virology in the future,” Daszak said.