This year, hundreds of people got diarrhea from Salmonella-contaminated beef, tahini, kratom, even pet guinea pigs, to name just a few. Salmonella bacteria normally hang out in animal intestines — and now, scientists are using machine learning to identify which animal’s intestines emitted the unpleasant and dangerous bacteria.

Researchers led by Xiangyu Deng at the University of Georgia trained an algorithm to recognize genetic differences between Salmonella strains pooped out by four common hosts: pigs, cows, poultry, and wild birds. When the team tested the algorithm on Salmonella genomes from eight outbreaks over the past 20 years, it correctly identified the animal sources for seven of them, according to a study published online today in the journal Emerging Infectious Diseases. The algorithm still needs more training, but it’s a start toward learning where Salmonella infections come from, and curbing the foodborne illness.

“This is an extra prong.”

There are about 1.2 million Salmonella infections every year in the US that cause diarrhea, fever, cramps, and kill around 450 people annually. It’s usually spread through poop, and infected animals often aren’t toilet-trained. They can get that poop everywhere: on their fur, feathers, bedding, food, you name it. So people can contract Salmonella by touching an infected animal, and then touching their mouths, or by eating food contaminated by infected animal poop.

If Salmonella turns up in beef, you might reasonably expect that the infection came from a cow. But Salmonella can also turn up in unexpected places — like in tahini, or on cantaloupes, or even drugs like cannabis and kratom. In those cases, it can be hard to identify the pestilential pooper: livestock, reptiles, rodents, and dogs can all spread Salmonella, along with other, less common animals. And knowing where the Salmonella is coming from is key to curbing the contagion, or preventing it in the first place.

That’s why Deng’s work could be an important advance. It’s still early, and the algorithm needs to learn how to identify other sources of Salmonella besides the four it was trained on. It also can only recognize the sources of a single type of Salmonella — and there are more than 2,500 out there. But it’s another weapon in the public health arsenal, says Nikki Shariat, a professor at Gettysburg College who specializes in Salmonella, who was not involved in the research. Outbreak investigations depend on epidemiology and interviews that track down what people ate and when. “This is an extra prong to that,” she says. And it could help speed up the process.

“Now you have this really amazing evidence.”

Bill Marler, an attorney who specializes in food safety, says a tool that helps track down the source of foodborne outbreaks could benefit policymakers. “Now you have this really amazing evidence through whole genome sequencing that this stuff came from this place,” he says. “Then really the question is, what can you do both from a food safety perspective, or a regulatory perspective, to solve the problem?”

To create the tool, Deng and his team used machine learning to identify genetic characteristics that could help them distinguish Salmonella that leaks out of different animal sources. One hypothesis is that when a Salmonella population becomes established in a particular host like, say, pigs, its genome might change a little over time. And Deng’s team suspected that if they had an algorithm sort through enough genomes, the program might learn how to recognize the genetic fingerprint of the Salmonella that comes out of a pig, and distinguish that from the Salmonella in a cow pat.

There are lots of different types of Salmonella, but the team focused on one of the most common here in the US, called Salmonella Typhimurium. They tracked down more than 1,400 Salmonella Typhimurium genome sequences from around the world. Some, they sequenced themselves. Others, the found in public health databases online. The collection included samples that were known to come from common sources like pigs, cows, poultry, and wild birds — which the researchers used to train their algorithm.

“A little bit of information is better than no information at all.”

The team tested their algorithm against samples from eight outbreaks that public health investigators had traced back to their animal sources. And just by looking at the genome, the program accurately attributed seven of the eight outbreaks to their animal sources. It was especially good at identifying poultry and swine sources. Researchers put in the genetic sequence of a Salmonella sample linked to a turkey pot pie outbreak, and the program spat back out the correct ID: poultry. When the team analyzed the evolution of the Salmonella strains linked to livestock, they noticed an interesting pattern: those strains only cropped up in about the 1990, and quickly spread across the US. “We suspect that industrialized livestock production may play a role in their spread and distribution,” Deng says.

Identifying the sources of historical outbreaks and tracking the spread of specific strains is one thing; identifying the unknown source of a human Salmonella infection is another. And not all Salmonella outbreaks come from the species that the algorithm was trained on. When the team tested the algorithm on samples that came from people, it managed to identify a cow, pig, poultry, or wild bird as the source for roughly a third of them. The rest were ambiguous. That could mean that the Salmonella strain infecting those people was a generalist strain that prefers to circulate between multiple host species. “They just jump around to different hosts and there’s no way for us to predict which source they came from,” Deng says. Right now, the tool doesn’t work for those.

The ambiguous IDs could also just mean that the strain came from a creature the algorithm wasn’t trained on, like a fish, or a turtle. That means that the algorithm still needs more genomes to learn from, Deng says. “As we sequence more genomes I’m sure the number will go up,” he says. As it stands, the algorithm is a proof of concept. “A little bit of information is better than no information at all,” he says. “There’s still a long way to go.”

Updated 3:15PM ET December 12th, 2018: The language has been updated to clarify how the Salmonella genome might develop signals for specific host associations.