RL Weekly 37: Observational Overfitting, Hindsight Credit Assignment, and Procedurally Generated Environment Suite

by Seungjae Ryan Lee

Subscribe to RL Weekly Get the highlights of reinforcement learning in both research and industry every week.

Dear readers, Happy NeurIPS! This week, I have made my summaries more concise to improve the reading experience. I hope that this change makes the newsletter easier to digest. I wait for your feedback, either by email or a feedback form. Your input is always appreciated. - Ryan

Observational Overfitting in Reinforcement Learning

Xingyou Song1, Yiding Jiang1, Yilun Du2, Behnam Neyshabur1

1Google 2MIT

What it says

Observational overfitting is a phenomenon “where an agent overfits due to properties of the observation which are irrelevant to the latent dynamics of the MDP family.” For example, in the saliency map above, the score and the background objects are highlighted red as they are deeply correlated with progress. This could hinder generalization: the authors report that simply covering the scoreboard with a black rectangle during training resulted in a 10% increased test performance. The authors use a Linear Quadratic Regulator (LQR) to validate the phenomenon, and find that overparametrizing potentially helps as a form of “implicit regularization.” The authors also try ImageNet networks (AlexNet, Inception, ResNet, etc.) on CoinRun environments, and show that overparametrization improves generalization to the test set.

Read more

Hindsight Credit Assignment

Anna Harutyunyan1, Will Dabney1, Thomas Mesnard1, Nicolas Heess1, Mohammad G. Azar1, Bilal Piot1, Hado van Hasselt1, Satinder Singh1, Greg Wayne1, Doina Precup1, Rémi Munos1

1DeepMind

What it says

Estimating the value function is a critical part of RL, as it quantifies how choosing an action in a state affects future return. The reverse of this is the credit assignment question: “given an outcome, how relevant were past decisions?” The authors define the “hindsight distribution” of an action as the conditional probability of the first action of the trajectory being that action over trajectories given some outcome (either state-dconditional or return-conditional). This learned hindsight distribution can be used to better estimate value functions or policy gradients. The authors validate new algorithms that use Hindsight Credit Assignment in a few diagnostic tasks.

Read more

Leveraging Procedural Generation to Benchmark Reinforcement Learning

Karl Cobbe1, Christopher Hesse1, Jacob Hilton1, John Schulman1

1OpenAI

What it says

OpenAI (Cobbe et al.) released a set of 15 new environments similar to the CoinRun environment released last year, where the environments are “procedurally generated.” Having content procedurally generated in many aspects (level layout, game assets, entity spawn location and timing, etc.) encourages the agent to learn a policy robust to such variations. Procedurally generated environments also allow for a natural division of training and test set by generating different environments.

Read more

External resources

Here are some more exciting news in RL:

Reinforcement Learning: Past, Present, and Future Perspectives

The recording of a NeurIPS 2019 presentation on RL by Katja Hofmann (Microsoft Research) is available online.

Stable Baselines - Reinforcement Learning Tips and Tricks

Stable Baselines, a major well-maintained fork of OpenAI Baselines, released a set of tips and tricks for RL.

Winner Announced for NeurIPS 2019: Learn to Move - Walk Around

The winners for each track of the NeurIPS 2019: Learn to Move - Walk Around was announced.

New State-of-the-art for Hanabi

Facebook AI wrote a blog post on how they build a new bot that achieves state-of-the-art in Hanabi, a collaborative card game.