In a preprint paper published this week by DeepMind, Google parent company Alphabet’s U.K.-based research division, a team of scientists describe Agent57, which they say is the first system that outperforms humans on all 57 Atari games in the Arcade Learning Environment data set.

Assuming the claim holds water, Agent57 could lay the groundwork for more capable AI decision-making models than have been previously released. This could be a boon for enterprises looking to boost productivity through workplace automation; imagine AI that automatically completes not only mundane, repetitive tasks like data entry, but which reasons about its environment.

“With Agent57, we have succeeded in building a more generally intelligent agent that has above-human performance on all tasks in the Atari57 benchmark,” wrote the study’s coauthors. “Agent57 was able to scale with increasing amounts of computation: the longer it trained, the higher its score got.”

Arcade Learning Environment

As the researchers explain, the Arcade Learning Environment (ALE) was proposed as a platform for empirically assessing agents designed for general competency across a range of games. To this end, it offers an interface to a diverse set of Atari 2600 game environments intended to be engaging and challenging for human players.

Why Atari 2600 games? Chiefly because they’re (1) varied enough to claim generality, (2) interesting enough to be representative of settings that might be faced in practice, and (3) created by an independent party and thus free of experimenter’s bias. Agents are expected to perform well in as many games as possible, making minimal assumptions about the domain at hand and without the use of game-specific information.

DeepMind’s own Deep Q-Networks was the first algorithm to achieve human-level control in a large number of the Atari 2600 games. Subsequently, an OpenAI and DeepMind system demonstrated superhuman performance in Pong and Enduro; an Uber model learned to complete all stages of Montezuma’s Revenge; and DeepMind’s MuZero taught itself to surpass human performance on 51 games. But no single algorithm has been able to achieve a perfect score across all 57 games in ALE — until now.

Reinforcement learning challenges

To achieve state-of-the-art performance, DeepMind’s Agent57 runs on many computers simultaneously and leverages reinforcement learning (RL), where AI-driven software agents take actions to maximize some reward. Reinforcement learning has shown great promise in the video game domain — OpenAI’s OpenAI Five and DeepMind’s own AlphaStar RL agents beat 99.4% of Dota 2 players and 99.8% of StarCraft 2 players, respectively, on public servers — it’s by no means perfect, as the researchers point out.

Image Credit: DeepMind

There’s the problem of long-term credit assignment, or determining the decisions most deserving of credit for the positive (or negative) outcomes that follow, which becomes especially difficult when rewards are delayed and credit needs to be assigned over long action sequences. Then there’s exploration and catastrophic forgetting; hundreds of actions in a game might be required before a first positive reward is seen, and agents are susceptible to becoming stuck looking for patterns in random data or abruptly forgetting previously learned information upon learning new information.

To address this, the DeepMind team built on top of Never Give Up (NGU), a technique developed in-house that augments the reward signal with an internally generated intrinsic reward sensitive to novelty at two levels: short-term novelty within an episode and long-term novelty across episodes. (Long-term novelty rewards encourage visiting many states throughout training, across many episodes, while short-term novelty rewards encourage visiting many states over a short span of time, like within a single episode of a game.) Using episodic memory, NGU learns a family of policies for exploring and exploiting, with the end goal of obtaining the highest score under the exploitative policy.

One shortcoming of NGU is that it collects the same amount of experience following each of its policies regardless of their contribution to the learning progress, but DeepMind’s implementation adapts its exploration strategy over the course of an agent’s lifetime. This enables it to specialize to the particular game it’s learning.

Agent57

Agent57 is architected such that it collects data by having many actors feed into a centralized repository (a replay buffer) that a learner can sample. The replay buffer contains sequences of transitions that are regularly pruned, which come from actor processes that interact with independent, prioritized copies of the game environment.

The DeepMind team used two different AI models to approximate each state-action value, which specifies how good it is for an agent to perform a particular action in a state with a given policy, allowing Agent 57 agents to adapt to the scale and variance associated with their corresponding reward. They also incorporated a meta-controller running independently on each actor that can adaptively select which policies to use both at training and evaluation time.

Image Credit: DeepMind

As the researchers explain, the meta-controller confers two advantages. By selecting which policies to prioritize during training, it lets Agent57 allocate more of the capacity of the network to better represent the state-action value function of the policies that are most relevant for the task at hand. Additionally, it provides a natural way of choosing the best policy in the family to use at evaluation time.

Experiments

To evaluate Agent57, the DeepMind team compared it with leading algorithms including MuZero, R2D2, and NGU alone. They report that while MuZero achieved the highest mean (5661.84) and median (2381.51) scores across all 57 games, it catastrophically failed in games like Venture, achieving a score that was on par with a random policy. Indeed, Agent57 showed greater capped mean performance (100) versus both R2D2 (96.93) and MuZero (89.92), taking 5 billion frames to surpass human performance on 51 games and 78 billion frames to surpass it in Skiing.

The researchers next analyzed the effect of using the meta-controller. On its own, they say it enhanced performance by close to 20% compared with R2D2, even in long-term credit assignment games like Solaris and Skiing, where the agents had to collect information over long time scales to get the feedback necessary to learn.

“Agent57 finally obtains above human-level performance on the very hardest games in the benchmark set, as well as the easiest ones,” wrote the coauthors in a blog post. “This by no means marks the end of Atari research, not only in terms of data efficiency, but also in terms of general performance … Key improvements to use might be enhancements in the representations that Agent57 uses for exploration, planning, and credit assignment.”