No researcher could read all the papers in their field – but machines are making discoveries in their own right by mining the scientific literature

Computers read between the lines (Image: Robin Lynne Gibson/Getty)

IN MAY last year, a supercomputer in San Jose, California, read 100,000 research papers in 2 hours. It found completely new biology hidden in the data. Called KnIT, the computer is one of a handful of systems pushing back the frontiers of knowledge without human help.

KnIT didn’t read the papers like a scientist – that would have taken a lifetime. Instead, it scanned for information on a protein called p53, and a class of enzymes that can interact with it, called kinases. Also known as “the guardian of the genome”, p53 suppresses tumours in humans. KnIT trawled the literature searching for links that imply undiscovered p53 kinases, which could provide routes to new cancer drugs.

Having analysed papers up until 2003, KnIT identified seven of the nine kinases discovered over the subsequent 10 years. More importantly, it also found what appeared to be two p53 kinases unknown to science. Initial lab tests confirmed the findings, although the team wants to repeat the experiment to be sure.


KnIT is a collaboration between IBM and Baylor College of Medicine in Houston, Texas. It is the latest step into a weird world where autonomous machines make discoveries that are beyond scientists, simply by rifling more thoroughly through what we already know, and faster than any human can.

In a paper to be presented at the Conference on Knowledge Discovery and Data Mining in New York City this week, the researchers say that society is better at generating new information than at analysing what it already has. “This leads to deep inefficiencies in translating research into progress for humanity,” they write. KnIT aims to iron out that inefficiency.

Society is better at generating new information than analysing what it already has

“In general, new p53 kinases are discovered at a rate of one per year,” says Olivier Lichtarge, who leads the work at Baylor. “We hope to greatly accelerate that rate of discovery.”

Studying kinases is important for cancer research, but the Baylor team thinks the approach can extend beyond biomedical studies to all areas of science. And if KnIT’s algorithmic discoveries hold up, they point to a future in which everyone could have a personalised algorithm trawling and making sense of the scientific literature to figure out cures for their ailments, including ones tailored at a genetic level.

Expanding KnIT to other areas of biology or the physical sciences isn’t straightforward. “We could run into big problems when we try and generalise to more proteins and genes,” Lichtarge says. And in subjects like physics, results tend to be presented using equations and graphs rather than words. However, data-mining groups are working to retrieve information from these too.

The idea that new knowledge can be unearthed by finding links between disparate strands of research was first crystallised in 1986 by information scientist Don Swanson at the University of Chicago. He analysed a database of scientific literature manually to deduce that fish oil might be a good treatment for Raynaud’s syndrome, a circulatory disorder, because studies showed that fish oil could reverse certain conditions also seen in Raynaud’s. His hunch turned out to be right.

Modern science has given us a far larger and more intricate haystack than the one Swanson picked through, but machine intelligence is now sorting through it to find new connections.

Ross King of the University of Manchester, UK, has developed a different kind of automated system, Eve, which he claims has already discovered a novel drug against malaria. Rather than extracting new knowledge from the literature, Eve robotically runs lab experiments focused on finding new drugs to treat neglected diseases. King is keeping the discovery secret until the work is published, but will say that the compound is an ingredient in several brands of toothpaste.

The webs of knowledge that computers create in this automated pursuit of discovery are useful to non-scientists, too. Sophia Ananiadou at the University of Manchester works on Facta+, a searchable database which holds huge amounts of information about cancer, based on data mined from the literature. Although it’s designed to help cancer researchers, she says it could be used by the public to learn more about diseases they have been diagnosed with, without having to read scientific papers themselves.

The purpose of data mining can also be flipped. Instead of finding new insights into specialised topics, systems like KnIT can find holes in existing research that need to be plugged.

Natasa Miskov-Zivanov of Carnegie Mellon University, Pittsburgh, is working on using similar techniques to build computational models of cells that can be used to test drugs. Normally, models take time to develop, with input from experimental biologists and theorists. But Miskov-Zivanov’s models build themselves quickly and automatically using results in the literature. The models can then be tested by scientists in the lab.

Miskov-Zivanov’s work is funded by the US defence agency DARPA as part of its Big Mechanism project, which aims to find new knowledge hidden in big data. “It takes several years to develop a meaningful model of what’s going on in a cell, but what we’re doing could speed up the process a lot,” she says. That would, in turn, speed up drug testing.

New breakthroughs could come by analysing scientific literature across disciplines – physics on the scale of cells and molecular biology, for instance. “I don’t think we could ever understand this huge, complicated puzzle without automated help,” says King. “There just aren’t enough PhDs in the world to do the experiments.”

This article appeared in print under the headline “Automated discovery”