RL Weekly 6: AlphaStar, Rectified Nash Response, and Causal Reasoning with Meta RL

by Seungjae Ryan Lee

Subscribe to RL Weekly Get the highlights of reinforcement learning in both research and industry every week.

AlphaStar: Pro-level Starcraft II AI

What it is

StarCraft II is a highly complex game that requires both macro-level management and micro-level control. DeepMind, a company famous for developing AlphaGo, AlphaGo Zero, and AlphaZero that play Go at a superhuman level, created a StarCraft II AI that won against top professional players. AlphaStar won all 5 matches against Dario “TLO” Wünsch from Team Liquid, and won 5 out of 6 matches against Grzegorz “MaNa” Komincz, also from Team Liquid.

The difficulty of creating a superhuman level StarCraft II agent originates from several different reasons. First, the game is “non-transitive”: there is no single best strategy that dominates all other strategies. Second, unlike Chess or Go, most crucial information (about the enemy) are hidden, so the agent must actively “scout” hidden areas to gain information. Third, the game requires long-term planning, and a lot of early-game decisions can impact late-game strategies. Fourth, the game has a rapid pace, so the agent is not given a long planning time. Finally, the agent has a large action space of about $10^{26}$.

Although the official algorithm has not yet been published, DeepMind has revealed some information about AlphaStar. It is a combination of many existing architectures: they specify that “the neural network architecture applies a transformer torso to the units, combined with a deep LSTM core, an auto-regressive policy head with a pointer network, and a centralised value baseline.” The agents are trained in a population-based multi-agent fashion.

Why it matters

Until now, best StarCraft II AI had many major components hand-crafted due to the complexity of the game, due to the various sources of complexity mentioned above. AlphaStar is the first AI in this game to achieve such high performance trained with reinforcement learning. Because the paper has not been published yet, it is too early to decide what made the difference: the agent architecture, the multi-agent structure, or simply the scale. However, the result by itself is quite worthwhile.

Since AlphaStar was showcased, there has been a lot of discussion about it. To my understanding, most of the complaints have been that AlphaStar did not win against professional players the way people wished for: instead of outsmarting the professional players, AlphaStar won with superior micro-level control. I cannot weigh my opinion on this issue since I am not a StarCraft expert, but this discussion itself shows very different opinions about people’s expectations of superhuman AI.

Read more

External Resources

Evaluating Agents on Nontransitive Games

What it is

“Nontransitive” games are games where relative dominance of strategies are not transitive. In nontransitive games, when strategy A is better than strategy B, and strategy B is better than strategy C, strategy C can still be better than A. One simple example of such game would be rock-paper-scissors. In this setting, it is possible that a single dominant agent does not exist.

Researchers at DeepMind instead suggest a different objective: “[discovering] the underlying strategic dimensions of the game, and the best ways of executing them.” Also, instead of optimizing a single agent, they suggest using populations: evaluating progress through performance of a population of agents, while ensuring that the population is diverse enough.

Why it matters

In nontransitive games, it is difficult to define the agent’s objective. If the opponent is fixed, an agent trained with reinforcement learning can learn to exploit the opponent’s strategy. However, if there are multiple opponents with different strategies, it is unclear how the agent should behave. Although PSRO_rN is only tested on two fairly simple games, it is exciting to see such a novel evaluation approach.

Also, although the paper does not directly mention AlphaStar, it mentions StarCraft as one such zero-sum game. It seems quite likely that PSRO_rN is the population-based multi-agent reinforcement learning method mentioned above.

Read more

Causal Reasoning with Meta RL

What it is

Researchers at DeepMind, UCL, and Harvard propose using a model-free RNN-based meta-reinforcement learning to train an agent to perform causal reasoning. An episode is divided into an “information phase” where the agent collects information, and the “quiz phase” where the agent is required to exploit the causal knowledge. Three experiments were conducted with observational, interventional, and conterfactual data settings. In observational setting, the agent could only observe during the information phase, whereas in the interventional setting it could intervene and change the environment. In the counterfactual setting, a counterfactual quiz is given.

Using techniques of Duan et al. and Wang et al., their “Active-Conditional Agent” is shown to have better performance than “Optimal Associative Baseline”, which achieves greatest reward without any causal knowledge.

Why it matters

Most existing applications of reinforcement learning are restricted to video games like AlphaStar or robotics, where the problem can easily be structured as a reinforcement learning problem by defining states, observations, actions, rewards, timesteps, and episodes. This paper uses reinforcement learning in an unconvential causal setting, and also shows its potential.

Read more

External Resources