Both the DeepMind and CMU approaches use deep reinforcement learning, popularized by DeepMind’s Atari-playing AI. A neural network is fed raw pixel data from a virtual environment and uses rewards, like points in a computer game, to learn by trial and error (see “10 Breakthrough Technologies 2017: Reinforcement Learning”).

Normally, the goal might be to achieve a high score inside the game, but here the two AI programs were given commands like “go to the green pillar” and then had to navigate to the correct object to receive rewards.

By running through millions of training scenarios at accelerated speeds, both AI programs learned to associate words with particular objects and characteristics, which let them follow the commands. They even learned to understand relational terms like “larger” or “smaller” to discriminate between similar objects.

Most important, both programs could “generalize” what they learned to unseen situations. If training scenarios contained pillars and also red objects, they could carry out the command “go to the red pillar” even if they had never seen one in training.

This makes them much more flexible than previous rule-based systems, says Chaplot. The CMU team mixed the visual and verbal input in a way that focused the AI’s attention on the most relevant information, while DeepMind gave their system extra learning objectives—like guessing how its view will change as it moves—that boosted its overall performance. Since the two approaches tackle the problem from different angles, Chaplot says combining them could provide even better performance.

The DeepMind researchers didn’t respond to requests for comment.

“These papers are preliminary, but I think it’s pretty exciting to see the progress they are making,” says Pedro Domingos, a professor at the University of Washington and the author of The Master Algorithm, a book about different machine-learning methods.

The research follows a trend in AI that involves bringing together hard problems like language and robotic control. Counterintuitively, this makes both challenges easier, he says. That’s because understanding language is easier if you can access the physical world it refers to, and learning about the world is easier with some guidance.

The millions of training runs required means Domingos is not convinced pure deep reinforcement learning will ever crack the real world. He thinks DeepMind’s AlphaGo, often held up as a benchmark of AI progress, actually shows the importance of incorporating a variety of AI approaches.

Michael Littman, a professor at Brown University who specializes in reinforcement learning, says the results are “impressive” and the visual input is far more challenging than that used in earlier work. Most previous attempts to use simulators to ground language have been restricted to simple 2-D environments, he notes.

But Littman echoes Domingos’s concerns about the approach’s real-world scalability and points out that the commands are generated in a formulaic way based on goals set by the simulator. That means they’re not really representative of the imprecise and contextual commands humans are likely to give machines in real life.

“I’m worried that people might look at the examples of the network responding intelligently to verbal commands and extrapolate that these networks understand language and navigation much more deeply than they actually do,” says Littman.