Google's AI subsidiary Deep Mind has built its reputation building systems that learn to play games by playing each other, starting with little more than the rules and what constitutes a win. That Darwinian approach of improvement through competition has allowed Deep Mind to tackle complex games like chess and Go, where there are vast numbers of potential moves to consider.

But at least for tabletop games like those, the potential moves are discrete and don't require real-time decision-making. It wasn't unreasonable to question whether the same approach would work for completely different classes of games. Such questions, however, seem to be answered by a report in today's issue of Science, where Deep Mind reveals the development of an AI system that has taught itself to play Quake III Arena and can consistently beat human opponents in capture-the-flag games.

Not a lot of rules

Chess' complexity is built from an apparently simple set of rules: an 8x8 grid of squares and pieces that can only move in very specific ways. Quake III Arena, to an extent, gets rid of the grid. In capture-the-flag mode, both sides start in a spawn area and have a flag to defend. You score points by capturing the opponent's flag. You can also gain tactical advantage by "tagging" (read "shooting") your opponents, which, after a delay, sends them back to their spawn.

Those simple rules lead to complex play because maps can be generated procedurally, and each player is reacting to what they can see in real time, limited by their field of view and the map's features. Different strategies—explore, defend your flag, capture theirs, shoot your opponents—all potentially provide advantages, and players can switch among them at any point in the game.

This complexity makes for a severe challenge for systems that are meant to teach themselves how to play. There's an enormous gap between what might be useful at a given moment, and the end-of-game score that the systems have to judge their performance against. How do you bridge that gap?

For their system, which they call FTW, the Deep Mind researchers built a two-level learning system. At the outer level, the system was focused on the end point of winning the game, and it learned overall strategies that helped reach that goal. You can think of it as creating sub-goals throughout the course of the game, directed in a way that maximizes the chances of an overall win. To improve performance of this outer optimization, the Deep Mind team took an evolutionary approach called population-based training. After each round of training, the worst-performing systems were killed off; their replacements were generated by introducing "mutations" into the best performing ones.

Beneath that, there's a distinct layer that sets a "policy" based on the outer layer's decisions. So if the outer layer has determined that defending the flag is the best option at the moment, the inner layer will implement that strategy by checking the visual input for opponents while keeping close to the flag. For this, the researchers chose a standard neural network trained through reinforcement learning.

Let the games begin

With the architecture in place, FTW was set to play itself on randomly generated maps in teams with one or more teammates. The goal was to get it to "acquire policies that are robust to the variability of maps, number of players, and choice of teammates and opponents, a challenge that generalizes that of ad hoc teamwork." The amount of effort required for this system to learn is pretty staggering; the researchers refer to going through 45,000 games as "early in training." Distinctive behaviors were still being put in place by 200,000 games in.

The researchers could track as FTW picked up game information. "The internal representation of [FTW] was found to encode a wide variety of knowledge about the game situation," they write. The agent first developed the concept of its own base, and it later figured out that there was an opposition base. Only once those ideas were in place did it figure out the value of picking up the flag. The value of killing your opponents came even later. Each of these had the chance to change future behavior: once FTW had figured out the location of the two teams' bases, most of its memory recalls focused on those areas of the map.

Some of these things ended up being remarkably specific. For example, in highly trained versions of the system, the neural network portion had an individual neuron dedicated to tracking whether a teammate had possession of the flag.

In the outer layer, many of the behaviors ended up recapitulating strategies used by human players. These include base defense and camping in the opponent's base. Other strategies, like following a teammate if they have the flag, were used for a while but later discarded.

With the training done, the researchers set a group of FTW players loose in a tourney with human opponents. By about 100,000 training matches, FTW could beat an average human player. By 200,000, it could beat a Quake expert, and its lead continued to expand from there. In the tournament, a team of two humans would typically capture 16 fewer flags per game than a team of FTW bots. The only time humans beat a pair of bots was when they were part of a human-bot team, and even then, they typically won only five percent of their matches.

What’s in a win?

That's not to say that FTW excelled in every aspect of the game. For example, humans' visual abilities made them better snipers. But at close range, FTW excelled in combat, in part because its reaction time was half that of a human's, and in part because its accuracy was 80 percent compared to the humans' 50 percent.

But FTW wasn't reliant on speed for its wins. The researchers artificially inflated its reaction time to be similar to that of a human and found that this only reduced the bots' edge, with teams of humans now able to win about 30 percent of the matches. That still left FTW with a significant edge, suggesting that there were some aspects of its overall strategy that gave it an edge.

While the computational resources needed to run bots through over 200,000 games of Quake are pretty massive, it's still impressive that FTW could start with the input of just a few pixels and no overall picture of the game before managing to figure out not only the rules, but also strategies that could consistently produce wins. While Quake III Arena is now two decades old, it does provide a high-pressure, multi-agent environment that represents a far more general problem.

For now, the Deep Mind team is still thinking about a few limitations of the FTW system. One of the problems was that the bot population tended to converge on a set of similar approaches, something that's only really effective if all agents in the environment are the same. In many situations (including many multiplayer games), the agents can be specialized, requiring solutions that remain more generalized. The genetic approach used in the outer layer of FTW's reward system also tends to focus very quickly on a limited subset of effective solutions.

Those concerns suggest that Deep Mind is looking at how to make FTW even more flexible than it already is.

Science, 2019. DOI: 10.1126/science.aau6249 (About DOIs).