It’s the proverbial needle in a haystack. The more information there is online, the easier it is to overlook the most important. Now an automated tool has been set the Herculean task of mining every science paper it can find online to help researchers come up with new ideas.

Semantic Scholar, which launches today from the Seattle-based Allen Institute for Artificial Intelligence (AI2), can automatically read, digest and categorise findings from the estimated 2 million scientific papers published each year. Up to half of these papers are never read by more than three people. The system aims to identify previously overlooked connections and information.

“Our vision is of a scientist’s apprentice, giving researchers a very powerful way to analyse what’s going on in their field,” says Oren Etzioni, director of AI2. “If you’re a medical researcher, you could ask ‘what’s the latest on these drug interactions?’ Or even a query in natural language like, ‘what are papers saying about middle-aged women with diabetes and this particular drug?'”


The system works by crawling the web for publicly available scientific papers, then scanning the text and images within them. By identifying citations and references in the text, Semantic Scholar can work out which are the most influential or controversial papers. It also highlights key phrases found in similar papers, extracting and indexing the datasets and methods each researcher used.

Contemporary science produces such huge volumes of research that no human can possibly read and understand everything in a single field – let alone all of science.

No Renaissance minds

“With millions of papers coming out each year, there are no Renaissance men or women anymore,” says Etzioni. “People’s eyes glaze over and they miss that key paper or technique that they could use, in a medical case, to save somebody’s life.”

AI2 is not the only organisation intent on digitising and analysing the world’s scientific discoveries. Last year, a system using IBM’s Watson AI technology, called The Knowledge Integration Toolkit (KnIT), mined 100,000 papers to successfully predict the interactions of a tumour-suppressing protein. IBM says KnIT is now fully automated to work without human oversight. The Defense Advanced Research Projects Agency (DARPA) in the US is also working on technology, codenamed Big Mechanism, to read all the scientific papers on certain types of cancer and use that knowledge to identify potential treatments. It is scheduled for completion by the end of 2017.

Kenneth Forbus of Northwestern University in Evanston, Illinois, is confident that services like this will prove useful in the future. “Machines that help us filter could increase the rate at which we find, if not diamonds in the rough, then at least useful nuggets,” says Forbus. “One might miss something, but professors already routinely use graduate students and colleagues for the same service, so the risks are well-understood.”

At launch, Semantic Scholar is focusing on computer science papers. It will then gradually expand its scope to include biology, physics and the remaining hard sciences, learning from how users interact with software as it goes.

“We have very specific goals along the way for semantic intensity – how deep into a paper our system can get to see what it’s about,” says Etzioni. “Ultimately, perhaps a human scientist doesn’t have to read it at all.”

Image credit: Peter Ginter/Getty