DeepMind's latest AI, appropriately named Agent57, can now play all 57 classic Atari 2600 games better than humans.

Although previous AI agents have been able to play some of these games over the last eight years, this represents the first time an AI agent can play all of the games well.

Earlier this week, the team published their results in a paper posted to the preprint server ArXiv.

From 1977 to 1990, Atari released 57 games for its Atari 2600, including classic titles like Pong, Donkey Kong, and Space Invaders. These games represent part of the home console origin story and were created for humans, by humans. Fast forward more than 30 years from than initial game and things have drastically changed. A new AI, appropriately named Agent57, is now the undisputed master of Atari 2600, beating average human players across all 57 titles.



It's not the first time that computer scientists have used video games as a benchmark to measure AI complexity. Since 2012, researchers have been testing out the strength of their artificial intelligence programs, known as "agents," by setting them loose in the Arcade Learning Environment (ALE), a platform for evaluating the development of general AIs, using Atari games. According to the paper that introduced ALE at the time, the interface presents "significant research challenges" in areas like reinforcement learning, model-based planning, imitation learning, and intrinsic motivation.



What's compelling about this work is that Agent57 has made leaps and bounds toward becoming what's known as a "general" AI agent. By definition, that means it should be able to complete a sufficiently wide set of tasks, sufficiently well. So rather than being killer at about half of the games, and just so-so on the other half, Agent57 can beat humans on all tasks.

"The ultimate goal is not to develop systems that excel at games, but rather to use games as a stepping stone for developing systems that learn to excel at a broad set of challenges," the researchers note on DeepMind's Agent57 webpage. When an agent performs well on a wide variety of tasks, as general agents do, they're considered intelligent.

This content is imported from YouTube. You may be able to find the same content in another format, or you may be able to find more information, at their web site.

Researchers commonly describe the performance of their AI agent by summarizing its achievements on multiple tasks, or games, as a single number, or score. Agents' mean or median scores on games are often used as a benchmark to check progression over time. A zero percent score means an agent's behavior is random, whereas anything over 100 percent means that the agent performs better than the human benchmark.

Historically, the agents' scores have increased, on average. That measurement can be problematic, though, because it doesn't accurately capture how many of the individual tasks the AI agent is actually excelling at—a really great score on a few games can skew the overall percentage. That means the score isn't a great measure of how general an agent is.

Let's pretend two AI agents are working against a benchmark that consists of 20 games, for example. Agent A receives a score of 500 percent across eight tasks, 200 percent across four tasks, and zero percent on eight tasks. After rounding, the agent's average mean score is 240 percent, while the median is 200 percent. Meanwhile, agent B earns a score of 150 percent on all tasks, meaning that both the mean and median are 150 percent.

Despite agent A having better performance by both measures, agent B is more general: it performs better than humans on more tasks than agent A.

Agent B is a general AI because it does sufficiently well on a sufficiently wide range of tasks. Agent A, meanwhile, ends up looking more like a one-trick pony. DeepMind

For that reason, median average scores are usually a better indicator of how general an AI agent is. Still, researchers must pay attention to the tails of the distribution of scores, because that can help eliminate any bias in how simple some games are compared to others. For instance, Donkey Kong could be perceived as more difficult than Pac-Man, due to having more controls. So, researchers have begun looking at performance at the hardest fifth percentile to better measure generality, as well.

One of the novel additions to Agent57 is a meta-controller that can adapt between the exploration-exploitation paradigm. "Agent57 is built on the following observation: what if an agent can learn when it’s better to exploit, and when it’s better to explore?" the researchers note. "With this change, Agent57 is able to get the best of both worlds: above human-level performance on both easy games and hard games."



Amazingly, Agent57 could scale with increasing amounts of computation, the team found. That means the longer it trains, the higher its scores get. Still, this takes a lot of time, and the researchers would like to create an agent that is more efficient in the future.

"This by no means marks the end of Atari research, not only in terms of data efficiency, but also in terms of general performance," they said.

Instead, the researchers want to continue making improvements until today's algorithms can achieve optimal performance, leading to more advanced, more general AI agents.

This content is created and maintained by a third party, and imported onto this page to help users provide their email addresses. You may be able to find more information about this and similar content at piano.io