By Matthew J.A. Smith, Mikayel Samvelyan, Tabish Rashid, University of Oxford

Editors note: The story below is a guest post written by current and former postgraduate students at the University of Oxford, a member of the NVIDIA AI Labs (NVAIL) program.

If you follow science news, you’ve probably heard about the latest machine-over-man triumph by DeepMind. This time, the new AlphaStar algorithm was able to defeat a professional in the popular competitive strategy game StarCraft II. While this may seem to imply that there is little room left for developing learning algorithms in the StarCraft II environment, the research community is only just beginning to employ this platform for AI development. In particular, researchers from University of Oxford just released a new test suite that uses the flexibility of the StarCraft II platform in order to challenge other scientists to develop agents that can learn to collaborate, coordinate, and cooperate.

Multi-Agent Reinforcement Learning

Machine learning algorithms are becoming increasingly good at solving all kinds of challenges, including complex, competitive computer games like StarCraft II. However, one exciting new area of research — where current approaches fail — considers the situation where several agents must learn to work together in order to solve a challenging problem. This domain is usually referred to as Multi-Agent Reinforcement Learning or MARL.



The key theme of MARL is decentralization. This means that in MARL settings, while different agents may act in a shared environment, there are restrictions on the ability of agents to share information, observe the world around them, and take actions accordingly. Instead of a single “brain” collecting and coordinating all information from all actors, each agent must be equipped with the ability to reason and act on its own.



Picture a busy morning at a popular brunch spot that recently opened. In order for the morning to go smoothly, there must be coordination between the servers, hosts, chefs, and assistants. While each member of the staff needs to learn to do their particular job as well as possible (chopping vegetables, managing the queue, taking customer orders, etc.), they must also learn how their decisions will affect the other staff members as well. Imagine a team of chefs that can cook meals very quickly: they must deliberately slow down if the server isn’t able to keep up, lest the food is served cold.

This example illustrates one of the core challenges of MARL. Each agent–which may have unique skills, unique constraints, or access to unique information–must learn to reason over the behavior of other agents. This problem is often compounded by the fact that all agents are considering the behavior of all other agents: if I’m trying to understand your behavior, I also need to consider that you are making decisions based on what you expect me to do, and thus I need reason over the fact that you are reasoning about me, which may also include you reasoning about my reasoning about your reasoning about my reasoning, ad infinitum.



Other challenges abound in MARL. Agents must also reason over other actors that may exhibit changing behavior. In our restaurant example, imagine that as a new dishwasher improves at their job, the chefs should learn to expect dishes to be ready sooner, enabling them to prepare dishes faster and potentially using better (cleaner) kitchen tools.



Finally, agents must be able to assign credit to themselves and others when things go well, or blame when things go poorly, as often it is difficult or impossible to determine exactly whose actions led to a particular outcome. Imagine an uncooperative worker that always blames others when things go wrong or a worker that always takes credit when things go well. These types of workers can make it difficult for a team to learn, whether it be from success or from failure.



Clearly, MARL poses unique learning challenges that single-agent reinforcement learning algorithms cannot handle. But how can we test algorithms for MARL without opening hundreds of failing brunch restaurants? A team of researchers has proposed a new set of benchmark tests which can be used to test and develop new learning algorithms that can handle these unique and challenging settings.



The StarCraft Multi-Agent Challenge

StarCraft II is a popular online multiplayer strategy game developed by Blizzard Entertainment. In a game of StarCraft II, players take on the role of a military commander in a sci-fi setting, directing units in real time to gather resources, build infrastructure, and perform combat operations. In a typical match, the player will control hundreds of units simultaneously, giving instructions to their entire army at once. This is how AlphaStar worked: a single reinforcement learning agent played the game, centrally controlling the actions of all units.

The developers of the StarCraft Multi-Agent Challenge (SMAC) have even grander ambitions for StarCraft II as an R&D platform for machine learning. They view the game as an opportunity to increase the difficulty and realism of learning challenges. In contrast to the approach taken by the AlphaStar team, the developers of SMAC choose to decentralize control of the StarCraft II units. This means that each individual unit perceives and can reason about the world in its own unique way. Clearly, this setting is more realistic–in the real world, there is no centralized control of people. Everyone perceives and operates in the world according to their own understanding. However, as a learning problem, this is much more challenging, for the reasons mentioned above. It is difficult to learn communication, to perform multi-agent credit assignment, and to reason over agents with changing behaviors.



In order to encourage scientific progress on these questions, the SMAC developers have released a set of challenge scenarios, each designed to test different aspects of MARL algorithms against StarCraft’s built-in (heuristic) AI. There are a wide variety of situations, ranging from simple skirmishes between small, identical forces, to large battles of asymmetrical forces, in which agents must learn to work together and use environmental terrain to their advantage.



In addition to the SMAC scenarios, the developers have released a toolkit for developing and testing multi-agent learning algorithms, termed PyMARL. There, they have implemented several popular MARL algorithms and tested them against the SMAC scenarios. These benchmarks have been posted in a release paper, which specifies technical aspects of the challenge. For their benchmark results, the team trained agents for approximately 18 hours on an NVIDIA GTX 1080Ti GPU, so on the scale modern learning algorithms, training is fairly quick, making development and debugging fairly easy.



Conclusion

For researchers working on multi-agent reinforcement learning, DeepMind’s AlphaStar algorithm represents just the first step of testing and training learning algorithms to play StarCraft II. The game provides an ideal environment to develop challenging problems that require collaboration. The StarCraft Multi-Agent Challenge represents a first step towards the development of agents which are able to reason over these types of problems, leading to more effective, and hopefully more integrated multi-agent systems.



For more information on SMAC and MARL, check out the following links:

Technical blog

SMAC GitHub repo

Tools for building MARL systems

Release paper

