This report describes a submission to the OpenAI Retro Contest with the goal of performing few-shot transfer reinforcement learning. The submission concentrates on DQN-based approach due to its off-policy nature and uses Noisy-Net instead of Distributional RL as the head of Rainbow. Replay buffer was initialized with randomized or pretrained expert policy in DAGGER-like fashion. Implemented #Exploration and Self-Supervised Prediction exploration algorithms and experimented with Soft Q-Learning to increase diversity of learned policies. Distilled Q-function across training environments using cross-entropy of advantage and used encoder network features learned during distillation to speed up convergence and transfer. The approach described in this report achieved 4819.08 points on the leaderboard test set, landed 15th spot and missed TOP10 by 105 points (overall top score was 6115.20).

Introduction

The goal of the contest was to perform few-shot transfer reinforcement learning by pretraining on provided 47 levels of Sonic The Hedgehog game and then maximizing average episode return on different, but similar levels of the game while being limited to 1M “shots” on each evaluation level. The contest organizers provided additional 11 validation levels and had two separate sets of test levels: one for leaderboard during the competition and one for final evaluation after the competition in order to avoid overfitting to the competition test set. During the competition it was allowed to sequentially submit solutions for leaderboard test evaluation thus enabling to perform substantial hyperparameter search.

Contest organizers also provided two reinforcement learning baseline implementations based on PPO (on-policy policy gradient) and Rainbow (off-policy Q-learning) algorithms. Both algorithms were evaluated on validation set either without any pretraining and with jointly pretraining one set of parameters across all training levels. Interestingly policy gradient based PPO algorithm did relatively poorly on validation without any pretraining, but benefited a lot from joint pretraining. At the same time Rainbow did relatively well without pretraining, but benefited relatively less from joint pretraining.

In addition a separate simple baseline JERK was provided that ignored observations and tried performing random exploration of action space and repeating action sequences that were previously beneficial. Some of submissions successfully extended this approach further despite the fact that originally deterministic game environment was made slightly non-deterministic by randomly delaying actions.

The approach described below achieved 4819.08 points on the leaderboard test set, landed 15th spot and missed TOP10 by 105 points (overall top score was 6115.20). It was a great learning experience for trying out various RL algorithms. The algorithm was run only 4 times on the leaderboard test set so hopefully it doesn’t overfit too much, although the variance can be quite large due to having only 5 levels in the leaderboard test set. Interestingly it achieves around 4K points on the validation set of 11 levels, so high test evaluation variance is pretty much a given.

Although final contest results have not been published yet, it appears that the provided Rainbow baseline hyperparameters could be tuned substantially to achieve better results on the leaderboard test set. In particular Rainbow baseline target network update period was set to 8K (32K in the original Rainbow paper), while some contestants experimented with setting it as low as 64. In retrospect such setting makes intuitive sense as the number of allowed steps in evaluation is only 1M, while the original Rainbow paper achieved most of its progress in 10M+ steps. Post-competition experiments on validation set confirm the benefits of low target network update rate, however note that the default value was used in the final submission.

Overview

The approach described in this report can be summarized as follows:

Concentrated on DQN-based approach due to its off-policy nature that allows reusing previously collected samples thus potentially leading to better sample efficiency, as compared to on-policy policy gradient methods such as TRPO or A3C.

Initialized replay buffer with randomized, pretrained or potentially arbitrary expert policy and randomly switched between expert and learned policy akin to apprenticeship learning and DAGGER.

Implemented a #Exploration and Self-Supervised Prediction for curiosity-driven exploration to alleviate the problem of sparse rewards, which are especially challenging in the context of maze navigation.

Experimented with improving exploration and transfer from training to evaluation by rewarding diversity of learned policies using Soft Q-Learning.

Trained on each training level separately, but then distilled the joint policy either using cross-entropy or MSE of Q-values. Repeated initialize-train-distill cycle multiple times similar to Policy Distillation.

Trained Noisy-Net instead of Distributional RL as the network head of Rainbow in order to perform policy distillation and to be more robust to varying reward scale.

Besides changing the network head, no other changes were made to baseline convolutional policy or value function network architecture of Rainbow. The input is four downscaled gray-scale screen frames of resolution 84x84. It would be interesting to experiment with recurrent policy or value function networks, however efficiently implementing replay buffer for learning recurrent Q-function is somewhat involved.

The rewards in the contest were proportional to changes in the x-coordinate of the main character: the further to the right the character goes the higher the reward. Similarly to the baseline the rewards were modified to never decrease the cumulative reward, which unfortunately means that in complex mazes there might be no additional rewards for large number of steps. Given that most RL algorithms operate with reward discount 0.99 there is essentially no planning horizon visibility beyond 100 steps. For this reason curiosity-driven exploration rewards become essential.

The raw action space of Sonic The Hedgehog is 12 buttons that can be on or off, however the provided contest baselines operated with 7 predefined button combinations. No changes to the baseline action space were made, although it seems that adding NOOP action would benefit a lot in some environments. It would be interesting to try training DQN in the original 2¹² action space although preserving sample-efficiency could prove challenging.

Exploration

In order to reward exploration the following approaches were applied:

Noisy-Net, being already part of Rainbow, performs exploration in parameter space of the Q-function by sampling network parameters from a normal distribution defined by learned mean and variance, the latter presumably going down to zero as the learning converges. Noisy-Net is essentially a more efficient way of bootstrapping DQN parameters.

#Exploration is a relatively simple way of performing curiosity-driven exploration based on the idea of extending upper-confidence bound from multi-armed bandit setting to RL with uncountable continuous observations. The main idea is to encode visited states as binary codes using a modified autoencoder and maintaining separate visitation counter for each encoding value. The exploration reward is proportional to 1/sqrt(N(encode(state)).

One disadvantage of curiosity-driven exploration is that as the training progresses the exploration rewards tend to decrease. To compensate for this behavior additional local exploration reward was used that was proportional to 1/N_local(encode(state)), where visitation counters were reset at the start of each episode. Additionally due to the nature the task rewards the curiosity rewards were weighted by the distance from the origin.

Pretrained convolutional autoencoder with 64-sigmoid embedding layer was used to encode observations to binary 64 bit codes by rounding embedding values to closest integer. During pretraining uniform noise was added to the autoencoder embedding values to force them to be close to either 0 or 1. No additional autoencoder regularization was used. Autoencoder was not additionally finetuned during the runtime.

Self-supervised prediction-based curiosity-driven exploration learns an embedding of states and then uses it in forward and inverse prediction models:

inverse model tries to predict which action caused the transition from given current state to the next state as described by their embeddings,

forward model minimizes MSE of predicting embedding of the next state given embedding of the current state and one-hot encoding of the action, with stop-gradient applied to the embedding layer of the next state.

The model is trained to minimize weighted sum of inverse and forward prediction losses in parallel with actual execution of the policy with exploration reward proportional to the forward loss. The original work used on-policy A3C algorithm, so in DQN setting additional replay buffer had to be introduced. The embedding layer was pretrained separately to speed-up convergence and transfer however fine-tuning of the embedding layer was allowed during runtime. Self-supervised exploration was not used in final submission.

Diversity

Presumably diversity of learned policies is beneficial for transfer from training levels to evaluation levels. As mentioned earlier exploration reward methods tend to stop rewarding policy diversity at convergence. In principle, for model-based curiosity methods some diversity is preserved as long as the learned model keeps (catastrophically) forgetting states that haven’t been observed for a long time. However there is no direct way of adjusting or optimizing for learned policy diversity.

Some experiments were performed with augmenting PPO, NoisyNet, and Distributional RL with entropy regularization, including Soft Q-Learning that interprets Q-values as energies. Entropy regularization was not used in final submissions, but was used during pretraining together with annealing temperature to zero, consequently sotf-max of actions slowly becoming arg-max. In practice no substantial improvement of learned policies was observed, although the experiments were too superficial to be significant.

Another interesting avenue would have been to directly maximize entropy of (sequences of) of actions and observations generated by a policy:

Initialization & Distillation

As mentioned in the overview using off-policy DQN-based methods allows pre-populating replay buffer with partial demonstrations from expert policies. Each episode starts with sampling actions from trained policy by default, however with large probability it switches to expert policy. As the learning progresses the probability of switching from expert to learned policy increases and the probability of switching from learned to expert policy decreases.

Two expert initializations were used:

Random expert which repeatedly randomly picks left or right direction and goes in that direction for randomly chosen number of steps while periodically jumping for randomly chosen number of steps.

Policy expert from previous policy distillation iteration.

In principle any other source of demonstrations could be used to prepopulate replay buffer.

In practice random expert worked substantially better in transfer setting presumably because it could perform long correlated action sequences (eg moving right for 100 steps while jumping). It would be interesting to see if recurrent policy expert would provide better initialization. Policy expert was still useful for speeding up convergence on training levels after distillation step described below.

After each iteration of pretraining on each training level separately, last 25% of learned policy actions were collected together with observations and Q-values, followed by distillation step:

Policy distillation by optimizing cross-entropy between collected actions and distilled policy action probabilities.

Value function distillation by optimizing MSE between collected Q-values and distilled Q-values of actions.

Advantage distillation by optimizing cross-entropy between collected actions and probability distribution of actions defined by softmax of distilled Q-values of actions.

Advantage distillation turned out to be the most beneficial in transfer setting, although it was used only for learning representations of observations that are beneficial for learning Q-function: the whole output layer was kept randomly initialized during transfer, which also enabled reusing learned weights for training both Noisy-Net and Distributional RL flavor of Rainbow. In practice the final submission was done using Noisy-Net.

In summary the approach is a variation of Policy Distillation, Actor-Mimic and DAGGER. A more modern but presumably also more expensive approach would be Distral that applies similar ideas in the context of simultaneous joint multi-task training, performs distillation in action-space, and applies Soft Q-Learning. Another (somewhat) off-policy approach is IMPALA that uses off-policy correction for actor-critic in the setting of distributed training of multi-task joint policy.

Evaluation Results

The DQN model submitted for evaluation on leaderboard test set consisted of:

Rainbow with Noisy-Net head

Random expert replay buffer initialization,

Q-function with encoding network initialized using advantage distillation

#Exploration exploration reward with pretrained convolutional autoencoder with 64 convolutional filter dimensions and 64 embedding dimensions

Target update interval: 8192

The following DQN models were evaluated on leaderboard test set:

DQN model with replay buffer initialized with distilled policy for 100K steps, resulting in score: 4057.54

DQN model with replay buffer initialized with random expert for 100K steps, resulting in score: 4683.12

DQN model with replay buffer initialized with random expert for 200K steps, resulting in score: 4819.08

Additionally the following PPO model was evaluated on leaderboard test set:

PPO model with policy initialized using policy distillation, resulting in score: 3262.52

No exploration reward was used with PPO because it didn’t provide any benefits during validation.

Leaderboard submissions were performed from git master branch. The final leaderboard submission described above achieved score 4819.08 landed at 15th spot and missed TOP10 by 105 points. For comparison, the best submission on the leaderboard reached 6115.20 points.

The DQN model on validation set consisted of:

Rainbow with Noisy-Net head,

Random expert replay buffer initialization for the first 100K steps,

Q-function with encoder network initialized using advantage distillation,

#Exploration curiosity exploration reward with pretrained convolutional autoencoder having 64 convolutional filter dimensions and 64 embedding dimensions with autoencoder weights fixed during runtime,

Self-Supervised Prediction curiosity exploration reward with pretrained autoencoder having 32 convolutional filter dimensions and 256 embedding dimensions with autoencoder weights finetuned during runtime.

This DQN model was evaluated on validation set with different settings of Rainbow target update interval:

8192 — resulting in mean score 3158.64 and final score 3879.57 ,

and final score , 1024 — resulting in mean score 3456.33 and final score 4507.38 ,

and final score , 64 — resulting in mean score 3677.11 and final score 4566.99.

In addition the same DQN model was evaluated with replay buffer initialized with distilled policy expert instead of random expert with different settings of Rainbow target update interval:

8192 — resulting in mean score 3099.08 and final score 4014.38 ,

and final score , 1024 — resulting in mean score 3442.56 and final score 4353.47 ,

and final score , 64 — resulting in mean score 3204.20 and final score 4285.58.

Final validation was performed post-competition and used git branch advanced_exploration. As can be seen from these results, replay buffer initialization with random expert leads to improved validation scores. Validation results have also shown that using Q-function with encoder network initialized using advantage distillation, resulted in about ~10% improvement in validation score. However, there was no conclusive evidence of usefulness of exploration rewards on validation set, which might be due to allowed number of steps being limited to 1M. Exploration rewards provably did help on some training levels. Finally, changing target network update interval to 64 from the default 8192 improves validation score by ~15%. Presumably similar change to the final submission would increase its leaderboard score from 4819.08 to ~5500.

Alternative Approaches

In principle access to the simulator and human demonstrations opens avenues for various alternative approaches.

Imitation learning from human demonstrations could substantially reduce exploration and consequently pretraining time. Given that human demonstrations have action space that is likely distinct from action space used by RL algorithms, it would make more sense to apply eg generative adversarial imitation learning only on sequences of observations without actions. A more simple approach could be to try reward shaping with value function learned from demonstrations although it is unclear if it is possible to learn good value function only from positive demonstrations. Another approach could be feeding human demonstration observations without actions into the replay buffer to train the value function part of the dueling network architecture.

Having access to simulator and human demonstrations opens the avenue to implementing curriculum learning: replay human demonstrations for N steps of an episode and then let the learned policy do the rest. It could be done either in reverse curriculum mode where the length of human demonstration replay is slowly decreasing or in random curriculum mode where the length of human replay is picked randomly at each episode. Some very simple initial experiments were performed, although the results were rather discouraging, partly because of instability of the simulator in replay mode.

Having access to a simulator that allows resetting to a previously seen state enables applying offline Monte-Carlo Tree Search (MCTS) techniques to finding optimal actions and then distilling them to an actual policy that operates on observations.

Finally, one could train a policy that can observe ROM of the simulator at each step and then perform Guided Policy Search to produce policy that operates on actual observations.

Acknowledgements

I’d like to thank Ethan Caballero, Tambet Matiisen, Cosmin Negruseri, Alex Shim, and Seth Stafford for fruitful discussions and for reviewing this document. Naturally, special thanks go to OpenAI for organizing this contest.