Montezuma’s Revenge, the classic Atari platform game, has finally been fully solved by machine learning, researchers from Uber AI Labs claim.

Reinforcement learning (RL) eggheads have been fascinated with the old 1980s game for a while. It features an Indiana Jones-like character named Panama Joe, who goes around exploring tombs and fending off enemies to find hidden treasure. The game is an ideal environment for studying the problem of sparse rewards.

Montezuma’s Revenge is tricky as the rewards are spread out. A lot of intermediary steps are required, such as collecting various items to unlock rooms and defeating enemies, before a reward is given. It’s therefore not easy for machines to figure out: the path to success is not obvious.

But researchers at Uber – yes, that Uber – believe they have managed to do it with a new suite of algorithms dubbed Go-Explore. Unlike previous attempts to beat the game, this version of Panama Joe learns how to play the game without copying moves from human gameplay, a technique employed by DeepMind and OpenAI.

The Uber code also achieved the highest score so far, reaching over 400,000 points on average difficulty and managed to solve all three levels. Most researchers had trouble just trying to complete the first level of the game using their bots. The key thing here, therefore, is that Uber's software player was able to solve all three levels, using new techniques, while previous research efforts by others don't get much further than the first level.

Explore, rinse and repeat

Go-Explore was led by Jeff Clune, a senior research manager at Uber AI Labs and an associate professor at the University of Wyoming, and relies mostly on search algorithms rather than fancier AI methods.

The game is represented as a series of states that are encoded as “cells”. The researchers describe each cell with the position of the bot, the current room the player is in, the level being played, and the number of keys collected so far.

Go-Explore chooses a particular cell to start from and begins randomly exploring the game from that starting position. If the new path scores higher points, it stores this memory in an archive and returns to those promising cells to improve over time.

This is repeated, so that the AI eventually works out how to complete a level and using search tools means the system doesn’t require the hefty compute of neural networks traditionally used. The second way of dealing with imitation learning does, however. Imitation learning allows the agent to retrace the actions taken during those high scoring pathways so that the agent can copy them and apply them in different states in new levels.

This makes the agent more robust, and the researchers reckon that their work could be useful for robotics. Bots could explore different solutions to a task in simulation first and then execute the best one in the real world.

But, not everyone is convinced. The method proposed by Uber AI Labs works because Montezuma’s Revenge is deterministic. The game has the same layout every time it’s played so it’s easy to memorise and work out how to beat it. The real world, however, isn’t like a computer game.

Artificial intelligence... or advanced imitation? How DeepMind used YouTube vids to train game-beating Atari bot READ MORE

“Many simulated environments make assumptions that do not hold in the real world, for example, the assumption that the world is deterministic, its states are fully observable, the agent's actions are timely, or that the world stands still while the agent computes actions and makes learning updates,” Rupam Mahmood, AI research lead at Kindred AI, a robotics startup, told The Register.

"Unfortunately, such differences between the real world and simulated environments have made learning with physical robots difficult for the current learning methods."

It may be more suitable for tasks where the environment doesn’t change too much, Julian Togelius, an associate professor focused on applying AI for games, told El Reg. “Say you had a warehouse with robots running around, this method could help teach the robots how to navigate if the layout of the warehouse stays the same.

“Go-Explore throws the whole RL problem overboard and decides to simply search instead, and exploit the fact that the environment is deterministic and the simulation is resettable.”

Both researchers also said that, at the moment, specific details on the algorithms remain unclear as a full peer-reviewed academic paper has yet to be published. ®