Illustration of the deep web versus the surface web - most web users see only the 'tip of the iceberg'. Credit@NASA/JPL Caltech

The Defense Advanced Research Projects Agency (DARPA) is reportedly developing computer tools as part of its Memex programme to probe the mysterious online world, referred to as the “Deep Web”. Data scientists at NASA have also joined in the effort to harness the benefits of deep web searching for scientific advancement.

A Google search on any given topic produces a vast amount of information, however the results that pop up apprise only part of the story – the “surface web”. In fact, it is estimated that fewer than 5% of the Internet may be accessed this way. The Internet contains a colossal load of information, sometimes called the “Deep Web” that is overlooked by conventional browsers and search engines. Michael Bergman, founder of Bright Planet describes the Deep Web as, “[placing] a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, [unseen].” This information may be useful for tracking criminals and mapping the spread of certain conditions. It might also prove useful for searching images and data from spacecraft.

Chris Mattmann, principal investigator for NASA Jet Propulsion Laboratory’s (JPL) work on Memex says, “We’re developing next-generation search technologies that understand people, places, things and the connections between them.” Memex may also be capable of searching images, videos, pop-up ads, forms and scripts as well as traditional text-based searches. “We’re augmenting Web crawlers to behave like browsers… in other words, executing scripts and reading ads in ways that you [might] when you usually go online. This information is normally [un]catalogued by search engines,” Mattmann said. The search tool may be able to track an object across many frames of a video – or even different videos.

Memex also aims to benefit the search for published scientific data in order to increase its accessibility to scientists. The technology might also be applied to NASA’s large data centres, such as the Physical Oceanography Distributed Active Archive Center, which collates and organises NASA’s ocean and climate data. On a more widely relatable level, Memex may make PDF documents more easily searchable, helping internet users to arrive at the required information more easily.

Additionally, NASA may feed data returned from the Curiosity rover’s many cameras and scientific instruments to Memex, using the search tool to more easily spot patterns and links on Mars. “Searching visual information about a particular planetary body [might] greatly facilitate the work of scientists in analysing geological features. Scientists analysing imaging data from Earth-based missions that monitor phenomena such as snowfall and soil moisture [might] similarly benefit,” according to JPL, NASA.

All of the code written for Memex is open source. In software development, open source is a more decentralised model that allows anybody to access the source code via a free license to the computer programme’s “blueprints” – meaning anybody may improve the design. Memex is being designed in this way to encourage a collaborative effort for improvement, where changes may be shared within the community. Others may then download and modify the programme then publish their version (or fork) back to the community. “We are developing open source, free, mature products and then enhancing them using DARPA investment and easily transitioning them via our roles to the scientific community,” Mattmann pointed out. JPL is one of the 17 teams working on Memex as part of this initiative.

Memex is a sister project to DARPA’s previous big data project – XDATA, which was also aimed at analysing huge volumes of data with defense, government and civilian applications.

What other productive information might science glean from the “Deep Web”?