DeepMind demonstrates AlphaStar beating human players

A few days ago DeepMind announced their latest iteration of their StarCraft II machine learning bot, now known as AlphaStar. So much has changed since they announced the StarCraft II Learning Environment back in August 2017. David Silver, highly regarded for his reinforcement learning experience with projects like AlphaGo, has joined the team, which seems to have grown considerably.

Back at Blizzcon in 2018, DeepMind demonstrated a bot that achieved a win rate of around 50% against the hardest game AI, pretty good but not mind blowing, so what changed in the last few months? I imagine that they would love to have demonstrated the decisive defeat of TLO and MaNa, but it seems they weren’t ready.

From here on out I will speculate on decisions that may have been made by the team. I have no knowledge of the inner workings of the team, or the neural network configurations they used, but to me they seemed to change their approach in important ways in order to improve their success.

To be open and clear, I am a contributor to the PySC2 framework and I have had some interactions with the DeepMind team. This may influence my opinion. I have never received any financial or other benefit from DeepMind. I did receive some financial reimbursement from Blizzard to attend the AI Summit, significantly less than my expenses.

The “whole map” view

When PySC2 first started it only supported interaction via the “screen” and “minimap”. This is very similar to how a human interacts with the game. In the demonstration we saw that AlphaStar can see their entire map, in approximately the same amount of detail as the previous “screen” functionality.

A demonstration of the screen and mini-map observations in PySC2

This functionality seems to have been made possible with the introduction of raw units. Most significantly the code comment mentions:

This differs from feature_units because it includes units outside the screen and hidden units, and because unit positions are given in terms of world units instead of screen units

The most important part of this statement for me is “units outside the screen” and “world units”. Don’t read too much into the “hidden units” part of this, from my understanding this is not as nefarious as it sounds, the API only exposes limited information here.

A demonstration of the “whole map” observation and actions

To me this is the most significant change I have seen. While many may see it as a “cheat”, I really see it as a sensible reduction of difficulty and complexity. My speculation is that it’s a short term adjustment in order to prove that their bot can do well in a simpler scenario.

Training a bot to understand how the screen fits into the entire world view is a complicated problem. It seems like it would very difficult for the bot to understand how units off screen can continue to move around. There is also the risk that the bot learns that moving the screen reveals the enemy, which may lead to a negative result, which may deter it from moving the screen near the enemy units. It may not be able to tell that the screen move did not cause the enemy units to materialise.

Eventually I have no doubt that DeepMind will increase the complexity of their bot towards being more human-like in its limitations. I say this because it seems like this is something co-lead Oriol Vinyals sees as his ultimate goal, a “pure” bot that competes with humans on the same level.

We have already seen some progressions on this. When the bot played TLO and MaNa the first time, from my understanding of comments made by DeepMind, the screen movements were somewhat scripted. This seems sensible, the API allows the bot to observe in the world space, but it must still act in the screen space. The only exception to this would be if Blizzard has altered the API for DeepMind and has not yet released the changes (this happens).

Update: I have been informed that the screen movements were added after the fact by DeepMind to make the replay more interesting and were not actually required as part of the bot’s interaction with the game in order to perform the actions.

Later, when the bot played MaNa in the live match, they mentioned that the bot had been trained to move the screen itself. This is no small feat! It increases the complexity of many actions such that the bot must now learn that moving the screen is part of the sequence to success. It also increases the action space considerably, depending on the resolution of the mini-map.

They switched from Terran to Protoss

If you look at the initial work back in 2017 you may discover that they originally had a Terran bot. This seems logical, there are some advantages of Terran over Protoss or Zerg:

There are no energy or creep limitations on buildings, so placement is not as difficult. As long as there is room for the building, and the necessary prerequisites, it can be built.

Marines only cost minerals and can shoot both land and air units, and require no additional buildings. This would have allowed the bot to achieve a pretty good amount of success early on.

DeepMind announces the StarCraft II Learning Environment

The main issue they faced, even early on, is that their bot would often lift and fly away buildings. This is mentioned in their paper.

The most successful agent, based on the fully convolutional architecture without memory, managed to avoid constant losses by using the Terran ability to lift and move buildings out of attack range. This makes it difficult for the easy AI to win within the 30 minute time limit.

In my experience, lifting and landing buildings can be difficult to manoeuvre even for a scripted bot, and it seems likely that a neural network would get trapped in the “don’t lose” outcomes fairly easily.

There is also another “bug” that frustrated me and may have contributed to DeepMind’s decision to shift away from Terran. When attempting to place an add-on there is a bug in the API that does not correctly receive the “quick” action. This is the action a human takes when they press the related hotkey and the add-on is placed automatically next to the building. Unfortunately the API requires you to specify the screen location to place the add-on, and if it is not close enough the building will lift, just as it would if there was not enough room. This then triggers the building landing issue, and it must be landed via the correct command in order to place the add-on.

So why would they choose Protoss instead of Zerg? Carriers, obviously! Seriously though, it seems like creep spread and continual morphing of buildings and units would be very difficult things for a bot to learn to manage.

Update: I have been informed that there is actually a bug relating hatchery rally points that may be preventing effective Zerg training

PySC2 improvements: worker counts

This was one of the incredible omissions from the original PySC2 that confused and infuriated me. I could not reliably script a bot to manage workers correctly since there was no easy way to tell how many workers were on a base or vespene geyser, and yet this information is freely available to a human. If I could not manually script a bot, how would a machine learning bot do it?

Without proper worker management it was extremely difficult to efficiently gather minerals and gas, which made it very difficult to produce more advanced units and research upgrades. It was possible that your entire worker count could be on gas and you would have no mineral income, or vice versa. This was a major road block to my progress.

I created the feature units functionality in PySC2 specifically to address this, exposing the ideal and current worker counts. This was later built upon as part of the raw units functionality. With this new information it would have been a lot easier for a AlphaStar to effectively manage resources as it could respond to raw numbers instead of trying to interpret unit locations on the screen.

Oriol describing bot interactions against the hardest game AI

PySC2 improvements: order length

One of the most important details of the game that seems to be missing or difficult to extract from the API data is the progress of unit production and upgrades. While you can get this data when a building is selected, a human can actually see progress bars above the buildings to indicate the progress of the current production item, so the bot is at a disadvantage if it must select a unit just to get access to this data. Additionally, when you select a production building you can see the number of items that are queued. This is not easy to extract from the API.

In order to at least get some insight into the production process, the order length attribute was added to raw units. While it doesn’t expose the production progress, it does expose the number of queued items. This would help a bot to keep track of the production queue, giving a more reliable action result.

From what I can tell the order length attribute can also be used to track things like the number of actions a unit needs to perform, such as moving, attacking, constructing buildings. Once again this provides far more consistent information to a bot, something that was previously inaccessible and made it more difficult for a bot to track the result of its actions.

They adopted and abandoned the RGB layer

When I attended the AI Summit at Blizzcon 2017 there was one feature that seemed to really get people excited, this feature was the RGB layer. This is essentially a graphical representation of the game much the same as a human would see it. This is probably the most pure way that a bot could observe the game. My guess is that people (including DeepMind) thought this would open up the ability to adapt Atari bots to the StarCraft II environment, given that those were based entirely on the graphical game output.

Google presents a simple Pong agent at I/O in 2018

I don’t think that the RGB layer was able to be adapted as they had hoped, and this approach appears to have been abandoned, for now. It’s possible that the RGB layer will be reintroduced at some future date, since it does seem to represent a significant challenge for machine learning, the ability to observe a small amount of visual information and relate this to an overall understanding of the unseen environment.

A single map

Being able to see a new map and immediately apply prior learning to play optimally in the new environment seems like a valid and complicated challenge. Even for a scripted bot it seems important to at least tell the bot where the natural and third expansions are located, and the possible locations of the enemy. Reducing the training to a single map is a clear and obvious way to reduce complexity.

I imagine this might actually be one of the next expansions of AlphaStar, using a concept such as Convolutional Neural Networks (CNNs) to understand interactions with smaller sections or features of a map, that may be common or similar in other maps.

A single opponent race

You may have noticed that AlphaStar currently only competes in Protoss vs Protoss mirror matchups, something that has been tried before with OpenAI’s Dota 2 bot. There are actually a few benefits to this, first of all it simplifies the information that is supplied to the bot about the enemy, instead of having to factor in possibilities of all races, the data set is far more limited.

Another benefit to a mirror matchup is that each game produces twice the amount of information. Both players equally contribute to the learning process, both what to do in order to win, and what not to do since it can lead to a loss. This process is commonly known as “self play”, and is exercised by AlphaStar in the form of the AlphaStar League.

No doubt AlphaStar will eventually compete in all races, however there are currently limitations and bugs in the API that might delay progress.

Supervised learning from human replays

AlphaGo was originally trained on human games, but eventually was evolved into AlphaGo Zero, where it learned only from playing against itself. It seems that AlphaStar is on the same path, first learning from human replays and eventually will most likely be trained purely on games against itself.

Training a bot on human replays is a pretty good way to ensure that your neural network structure is sound and produces reliable results, essentially you are testing that your bot can do a reasonable job of statistical analysis and outcome prediction. Once you have proven that you can set it free explore the unknown and come up with new strategies like overproducing probes.

TPUs

While there have been significant advances in neural network structures and machine learning techniques, the hardware factor cannot be ignored. Google has been developing processing hardware known as Tensor Processing Units that are designed specifically for the task of computing neural network tasks as quickly as possible.

Blizzard has developed a modified version of StarCraft II that can play essentially as fast as the computations can be performed, meaning entire games can happen within a few short minutes (or faster). The faster these games can be played, the sooner a researcher can identify whether or not their network architecture is producing the desired result.

The accelerated process of designing and testing network architectures allows a much faster adaptation and evolution loop than ever before. It was mentioned during the demonstration that the bot was trained for a week or two between rounds of matches with the human opposition. As the hardware capability increases the same could be achieved in days or even hours.

Did they cheat?

I have seen a few people saying that AlphaStar only won because it used impossibly high APM to achieve insane micro. This might be true, I haven’t investigated the replays thoroughly enough to know one way or the other. In my opinion it seems unlikely that DeepMind would release replays like this to a highly technical audience and not expect them to find this information. The word “cheat” implies some intentional malice, and I think this would be highly unlikely given the high profile negative impact this would have on the team’s reputation.

AlphaStar competes against LiquidTLO

There are two scenarios I can think of immediately in defence of DeepMind. First, they simply miscalculated. This seems a bit unlikely though, considering the amazing brains they have in their team, however their metric may be slightly wonky if they are averaging the APM over a long period, with extremely high bursts followed by extreme lows. From what I can gather the bot was put together pretty quickly, so it’s possible they didn’t pay enough attention to this metric.

The second scenario I can think of is that they consider the extra APM to come from “ineffective” actions, not in the traditional EPM sense, but in compensating for differences between how a human interacts with the game, and how the bot must interact with the game via the API. These actions might include camera moves or information discovery (see my comments on production progress above).

Update: I have been informed that camera movements do not contribute to APM. I do have an untested theory that the way “entire map” actions are performed could bump the APM up.

In any case, even with a boosted APM I think it’s pretty incredible that the bot was able to learn to take advantage of this and perform well. I am sure that any issues relating to “unfair” APM will be dealt with.

To me the fact that the bot can see the entire map in detail that a human cannot is far more important. The APM issue seems solvable, simply adjust the way that actions are performed and/or measured. Being able to perform well without being able to see the entire map is a far more difficult obstacle to overcome.