When oceanographer Peter Wiebe sat down recently to write a paper on findings from his January cruise to the Red Sea, he wanted to examine all data sets on plankton in the region. He knew other researchers have been sampling the organisms for years, but there was a problem: He didn’t know where to find those data sets.

“These data centers are kind of black holes,” says Wiebe, who works at the Woods Hole Oceanographic Institution in Massachusetts. “The data go in, but it’s very hard to figure out what’s in there and to get it out.”

That could soon change. Wiebe is working with a group of computer scientists to lay the groundwork for a smarter academic search engine that would help geoscientists find the exact data sets and publications they want in the blink of an eye, instead of spending hours scrolling through pages of irrelevant results on Google Scholar. The group officially kicked off their project, called GeoLink, yesterday at the American Geophysical Union (AGU) fall meeting in San Francisco, California. The research effort is part of EarthCube, an initiative funded by the National Science Foundation (NSF) to upgrade cyberinfrastructure for the geosciences.

Over the next 2 years, Wiebe and colleagues will build computer programs that can extract information from AGU conference abstracts, NSF awards, and geoscience data repositories and then digitally connect these resources in ways that make them more accessible to scientists. A pilot project that concluded this year, known as OceanLink, has already developed some of the underlying design. If the new project garners sufficient community interest, the researchers could eventually turn it into a comprehensive one-stop search hub for the geosciences, says computer scientist Tom Narock of Marymount University in Arlington, Virginia, another principal investigator on the project.

Projects like GeoLink are part of a growing effort by the scientific community to make literature reviews more efficient by leveraging the increasing ability of computers to process texts—a much needed service as millions of new papers come out every year. A similar initiative from the Allen Institute for Artificial Intelligence (AI2) in Seattle, Washington, is developing an intelligent academic search engine for computer science. Called Semantic Scholar, it is expected to be fully released by the end of 2015. Eventually, the institute plans to expand Semantic Scholar’s coverage to include other subjects, says AI2 Chief Executive Officer Oren Etzioni.

Existing academic search engines boast extensive coverage of scientific literature. (Google Scholar alone indexes about 160 million documents by some calculations.) Their reliance on keyword searches, however, often means users get more junk than treasure. That frustrates scientists such as Wiebe, who wants to find papers related to specific research questions such as “growth of plankton in the Red Sea.” Search engines also don’t typically include raw data sets.

In contrast, GeoLink and Semantic Scholar attempt to build fine-grained, niche search engines catered to specific subject areas, by tapping into deeper semantic processing that helps computers establish scientifically meaningful connections between publications. When a scientist types in “plankton in the Red Sea,” for example, the search engine would not only understand it as a string of characters that show up on papers, but also know the researchers who investigated the topic, the cruises they took, the instruments they used, and the data sets and papers they published. Google has applied similar techniques to improve its main search engine, but projects like GeoLink benefit from input from scientists with extensive knowledge in the subject area, who identify meaningful links that computer scientists then translate into code.

The potential of these projects goes beyond helping scientists find the right papers quickly, says computer scientist C. Lee Giles of Pennsylvania State University, University Park. By extracting information on methods and results from a paper and pooling the data together, search engines like Semantic Scholar could automate the process of literature review and comparison.

For example, Etzioni says, it would take a talented computer science graduate student weeks of extensive reading to gain an overview of techniques used in the last 5 years to perform dependent parsing (a task in natural language processing), the data sets produced, and the accuracy rates. And they’d probably miss a few things. In contrast, Semantic Scholar could potentially compile the techniques and results into a neat table within seconds. “We are imagining techniques that go way beyond just paper recommendation, to the point where we are really generating novel insights.”

Such instant overview would especially benefit junior scientists and interdisciplinary scientists who are entering a new field of study, says computer scientist Christina Lioma of the University of Copenhagen. It would also enable scientists to identify emerging trends in a field and adjust their directions accordingly, Giles says.

Realizing the technology’s potential, however, partially depends on having publicly accessible, text-minable literature for computers to read. Although governments are increasingly pushing for such open access, allowing machines to mine the full texts of papers held behind journal paywalls remains a contentious issue. For now, the GeoLink project will mine only publicly available abstracts of studies. (Semantic Scholar receives its papers from CiteSeerx, a digital library co-founded by Giles that covers 4 million open-access computer science papers.)

Computer scientists still have a lot of work to do to improve the accuracy of text processing, Giles says. For example, machines still trip up over tasks like identifying that “P. Wiebe” and “Peter Wiebe” refer to the same person.

Nonetheless, Giles believes that the semantic Web approach “is the Web of the future.”