A few years ago, a breakthrough in machine learning suddenly enabled computers to recognize objects shown in photographs with unprecedented—almost spooky—accuracy. The question now is whether machines can make another leap, by learning to make sense of what’s actually going on in such images.

A new image database, called Visual Genome, could push computers toward this goal, and help gauge the progress of computers attempting to better understand the real world. Teaching computers to parse visual scenes is fundamentally important for artificial intelligence. It might not only spawn more useful vision algorithms, but also help train computers how to communicate more effectively, because language is so intimately tied to representation of the physical world.

Visual Genome was developed by Fei-Fei Li, a professor who specializes in computer vision and who directs the Stanford Artificial Intelligence Lab, together with several colleagues. “We are focusing very much on some of the hardest questions in computer vision, which is really bridging perception to cognition,” Li says. “Not just taking pixel data in and trying to makes sense of its color, shading, those sorts of things, but really turn that into a fuller understanding of the 3-D as well as the semantic visual world.”

Li and colleagues previously created ImageNet, a database containing more than a million images tagged according to their contents. Each year, the ImageNet Large Scale Visual Recognition Challenge tests the ability of computers to automatically recognize the contents of images.

In 2012, a team led by Geoffrey Hinton at the University of Toronto built a large and powerful neural network that could categorize images far more accurately than anything created previously. The technique used to enable this advance, known as deep learning, involves feeding thousands or millions of examples into a many-layered neural network, gradually training each layer of virtual neurons to respond to increasingly abstract characteristics, from the texture of a dog’s fur, say, to its overall shape.

The Toronto team’s achievement marked both a boom of interest in deep learning and a sort of a renaissance in artificial intelligence more generally. And deep learning has since been applied in many other areas, making computers better at other important tasks, such as processing audio and text.

The images in Visual Genome are tagged more richly than in ImageNet, including the names and details of various objects shown in an image; the relationships between those objects; and information about any actions that are occurring. This was achieved using a crowdsourcing approach developed by one of Li’s colleagues at Stanford, Michael Bernstein. The plan is to launch an ImageNet-style challenge using the data set in 2017.

Algorithms trained using examples in Visual Genome could do more than just recognize objects, and ought to have some ability to parse more complex visual scenes.

“You’re sitting in an office, but what’s the layout, who’s the person, what is he doing, what are the objects around, what event is happening?” Li says. “We are also bridging [this understanding] to language, because the way to communicate is not by assigning numbers to pixels—you need to connect perception and cognition to language.”

Li believes that deep learning will likely play a key role in enabling computers to parse more complex scenes, but that other techniques will help advance the state of the art.

The resulting AI algorithms could perhaps help organize images online or in personal collections, but they might have more significant uses, enabling robots or self-driving cars to understand a scene properly. They could conceivably also be used to teach computers more common sense, by appreciating which concepts are physically likely or more implausible.

Richard Socher, a machine-learning expert and the founder of an AI startup called MetaMind, says this could be the most important aspect of the project. “A large part of language is about describing the visual world,” he says. “This data set provides a new scalable way to combine the two modalities and test new models.”

Visual Genome isn’t the only complex image database out there for researchers to experiment with. Microsoft, for example, has a database called Common Objects in Context, which shows the names and position of multiple objects in images. Google, Facebook, and others are also pushing the ability of AI algorithms to parse visual scenes. Research published by Google in 2014 showed an algorithm that can provide basic captions for images, with varying levels of accuracy (see “Google’s Brain-Inspired Software Describes What It Sees in Complex Images”). And, more recently, Facebook showed a question-and-answer system that can answer very simple queries about images (see “Facebook App Can Answer Basic Questions About What’s in Photos”).

Aude Oliva, a professor at MIT who studies machine and human vision, has developed a database called Places2, which contains more than 10 million images of different specific scenes. This project is meant to inspire the development of algorithms capable of describing the same scene in multiple ways, as humans tend to do. Oliva says Visual Genome and similar databases will help advance machine vision, but she believes that AI researchers will need to draw inspiration from biology if they want to build machines with truly human-like capabilities.

“Humans draw their decision and intuition on lots on knowledge, common sense, sensory experiences, memories, and ‘thoughts’ that are not necessarily translated into language, speech, or text,” Oliva says. “Without knowing how the human brain creates thoughts, it will be difficult to teach common sense and visual understanding to an artificial system. Neuroscience and computer science are the two sides of the AI coin.”