Recently at NeurIPS 2018 in Montreal, I witnessed Uber’s Jeff Clune present Go-Explore, their solution to Montezuma’s Revenge, the Atari game famous for posing a very difficult exploration problem to the current generation of RL algorithms.

Go-Explore requires simplifying state representation. Image from https://eng.uber.com/go-explore/

But Uber’s claim was met with controversy.

What was the controversy?

Afterwards at the poster session, I saw David Silver of Google’s DeepMind discussing the result with Jeff Clune, and they appeared to differ on their definitions of what problem was being tackled here, in terms of the assumptions needed for Go-Explore to work.

The general controversy around their claim of a SOTA result involves disagreements about whether the simplifying assumptions needed for their method are valid, points that have been covered online:

Go-Explore requires resetting of the environment to previously visited states for continued exploration. This can be done by forcing the simulator directly to that state, or memorizing actions as long as the environment is fully deterministic. There is pushback on both resetting the environment, and replaying deterministic trajectories. Atari games are generally deterministic, so if you find a perfectly timed sequence of actions, you can reliably score well. But winning that way does not demonstrate intelligence — see eg. the dumb “Brute” agent described in Machado et al 2017. Sticky actions were designed to cripple dumb approaches like Brute. Now Uber has said their approach can handle sticky actions, but only at test time. This is not sufficient for some researchers — they want to see methods that handle sticky actions at train time. Otherwise the bones of the learning is happening on an easier variant of the problem.

use of downsampling of screen pixels to summarize state is a brittle and non-general method. Uber’s Joost Huizinga noted that this method does not even work on some ALE games like Pitfall. He also says that other (presumably more general) state representations might be substituted — this remains to be seen.

the addition of domain-specific knowledge (x-y position of character, room and level, keys held). This part is optional but they get the best results by including this knowledge.

Some pointed out that this method is less reinforcement learning than planning, and Uber’s Adrien Ecoffet replied that “The original ALE paper (https://arxiv.org/abs/1207.4708) tried a few planning algorithms, all of which got a flat 0 on Montezuma’s.” So at very least, the originators of ALE considered planning approaches worth comparing against. Note that very paper states distinguishes the two classes of approaches succinctly in section 3: “Planning and reinforcement learning are two different AI problem formulations that can naturally be investigated within the ALE framework”.

Overall it would appear that the tension could be stated as follows:

Uber is motivated to claim SOTA and downplay the importance of their simplifying assumptions. Uber is historically has not been a big player in RL research, so they are motivated to be seen as credible here.

Critics of the approach — for example DeepMind as arguably the most capable organization for RL research today — are motivated to keep standards for SOTA very high, ostensibly to protect the sanctity of the field, but perhaps also so they can obtain SOTA eventually for themselves. DeepMind was instrumental in making Atari and Montezuma well-known benchmarks in the first place (DQN inventor Vlad Mnih started the deep RL revolution with Atari, and ALE paper author Marc G. Bellemare used to work for DeepMind), so it would be ironic if they lost this milestone to relative newcomers based on disagreement on the problem definition.

My own view is that, while the result is interesting, the controversy itself and what it means for RL research is even more interesting. I want to look at this situation through a specific lens:

the RL research world is itself a multi-agent, competitive reinforcement learning environment

researchers and institutions are RL agents

rewards are the media attention and accolades that are obtained by achieving, or making a convincing claim of achieving, State Of The Art results

agents (researchers) are continually learning how to optimize for this reward signal. It is well-known that RL agents will discover hacks, glitches, shortcuts, work-arounds to obtain rewards in unexpected ways

media attention for claimed SOTA, may not be identical to moving actual SOTA forward. So there may be a gap between the reward function that the agents get, and what we actually want to incentivize (true SOTA on truly hard problems)

the reporters that write headlines like “Uber AI ‘reliably’ completes all stages in Montezuma’s Revenge”, simply do not have the sophistication to ask the questions about detailed assumptions for claiming SOTA. Even people in RL seem to disagree on these points. And Uber made just a blog post, without a complete research paper which would contain all the nitty gritty details and caveats.

What should DeepMind do?

DeepMind needs to close the gap between globally ideal reward function (advancing the true SOTA) and reward function experienced by agents (media accolades) by further clarifying what they see as the true SOTA requirements. Now some may claim that this has already been done, but if that were sufficiently true we would not be in this situation. The requirements should be offered both in technical depth appropriate for researchers to distinguish precisely between claims within and outside of these bounds, but also in layman’s terms so that media can digest it.

After clarifying the SOTA requirements, DeepMind should split the problem into two Montezuma’s Revenge versions: with the full SOTA constraints as they see them (MR-0), and without these constraints (MR-1). For example, MR-0 would require stochasticity through sticky actions at train time and not just test time, and would not allow the simulator to be reset to specific points.

As for the pixel downsampling issue, it does not work on some other ALE games so it clearly does not generalize. This suggests a hybrid benchmark, which includes MR-0 and *-0 for the other games, a measure of ability to generalize.

Presumably DeepMind would find solving MR-1 relatively simple. If they cared to, they might overwhelm the Go-Explore claim with one or more alternative solutions to the problem, that inasmuch as possible do not draw heavily on the Go-Explore methods. This would minimize the impact of Uber’s SOTA claim.

To see the effect of this, imagine that right after AlphaZero, another group offered two very different ways to solve Go to the same level of performance — they would strongly curtail the perceived lead that Deepmind has on AI. That has not happened with AlphaZero and Go.

Whether they chose to knock down MR-1 or not, clearly defining SOTA requirements would leave MR-0 as a legitimate, clearly defined, yet-unsolved grand challenge for AI — presumably for DeepMind or other researchers to solve in the coming years (or months, given the pace of RL research recently).

What should Uber do?

Uber does not have the deep bench full of top RL researchers that DeepMind has, so its options are more limited. Presumably it wants publicity and bragging rights for its efforts, and it is getting them with its current strategy. I think whether or not it meets certain researchers’ criteria for ALE or generalizes well, Go-Explore is innovative and scrappy, and brought the spotlight to Uber AI .

So Uber should keep doing what its doing: obtaining the results it can, vocally making its case for the results it can obtain, and in some sense “gaming” the system to maximize its reward.

In doing so, Uber is forcing the community to clarify assumptions behind grand challenges, which will ultimately improve the arena for RL research.

(And this is not to say Uber’s methods are not valuable or useful — these concerns may be are orthogonal to whether or not they can be considered SOTA for the MR-0 benchmark.)

tl;dr To use Uber’s language: the Montezuma/Go-Explore controversy provides an opportunity to “robustify” deep RL research itself.

If you enjoyed this article, send me endorphins by adding a clap or two below!