Multi-task learning - allowing a single agent to learn how to solve many different tasks - is a longstanding objective for artificial intelligence research. Recently, there has been a lot of excellent progress, with agents like DQN able to use the same algorithm to learn to play multiple games including Breakout and Pong. These algorithms were used to train individual expert agents for each task. As artificial intelligence research advances to more complex real world domains, building a single general agent - as opposed to multiple expert agents - to learn to perform multiple tasks will be crucial. However, so far, this has proven to be a significant challenge.

One reason is that there are often differences in the reward scales our reinforcement learning agents use to judge success, leading them to focus on tasks where the reward is arbitrarily higher. For example, in the Atari game Pong, the agent receives a reward of either -1, 0, or +1 per step. In contrast, an agent playing Ms. Pac-Man can obtain hundreds or thousands of points in a single step. Even if the size of individual rewards is comparable, the frequency of rewards can change over time as the agent gets better. This means agents tend to focus on those tasks which have large scores, leading to better performance on certain tasks, and far worse on others.

To resolve these kinds of issues, we developed PopArt, a technique that can adapt the scale of scores in each game so the agent judges the games to be of equal learning value, no matter the scale of rewards available in each specific game. We applied a PopArt normalisation to a state-of-the-art reinforcement learning agent, resulting in a single agent that can play a whole set of 57 diverse Atari video games, with above-human median performance across the set.

Broadly speaking, deep learning relies on the weights of a neural network being updated so that its output moves closer to the desired target output. This also applies when neural networks are used in the context of deep reinforcement learning. PopArt works by estimating the mean and the spread of these targets (such as the score in a game). It then uses these statistics to normalise the targets before they are used to update the network’s weights. Using normalised targets makes learning more stable and robust to changes in scale and shift. To obtain accurate estimates - of expected future scores for example - the outputs of the network can then be rescaled back to the true target range by inverting the normalisation process. If done naively, each update to the statistics would change all unnormalised outputs, including those that were already very good. We prevent this from happening by updating the network in the opposite direction whenever we update the statistics, this can be done exactly. This means we get the benefit of well-scaled updates, while keeping the previously learnt outputs intact. It is for these reasons that we call our method PopArt: it works by Preserving Outputs Precisely while Adaptively Rescaling Targets.