DeepMind’s AI learned to play StarCraft. Next, Tesla’s AI will try to learn to drive.

Scaling up machine learning with massive training datasets

AlphaStar, a new AI from Google‘s DeepMind, has learned to play StarCraft II. AlphaStar defeated MaNa – one of the top professional StarCraft players in the world – in five consecutive games in December. It remains an open question whether AlphaStar won through sheer mechanical superiority – the speed and precision of its clicks and keystrokes – rather than good strategy and tactics. MaNa defeated a new version AlphaStar in a live match in January. The new version was deprived of an unfair advantage: the ability to see the entire game map at the same time.

Some StarCraft fans have complained that, despite DeepMind’s assurances, AlphaStar is able to execute a superhuman number of clicks and keystrokes. With precise enough unit control, it’s possible to have absurd, godlike power in StarCraft. MaNa already won when one unfair advantage was removed. If the mechanical aspects of AlphaStar were truly limited to human levels, its competitive strength against humans might falter. Hopefully DeepMind will put this to the test.

No matter how that shakes out, I think it’s safe to say that AlphaStar is genuinely compelling because it has learned some intelligent behaviours. Some of these are simple: if a group of units is moving across the map and one unit trails behind, AlphaStar will send the group back to collect the straggler. There is safety in numbers, and this is a habit human players pick up. Some behaviours are advanced: AlphaStar expends extra resources to create more workers than can be put to use mining. This might seem wasteful, but the backup workers allow AlphaStar to maintain the maximum mining rate after its base is attacked and workers are killed. According to one analysis I’ve seen, the extra productivity following an attack more than outweighs the cost of the backup workers. It seems like AlphaStar may have discovered a new, better strategy that until now has escaped even professional human players.

Even where AlphaStar has an unfair advantage against human players, what it has learned is still impressive. It deftly moves units in and out of the line of fire to avoid attacks while dealing damage. The fact that it knows that it should do this, and it knows how to do it is remarkable. To me, this feels analogous to some of the robotics challenges that engineers are trying to solve today. A self-driving car needs to keep a safe distance away from other road users, while still being aggressive enough to push around a car that’s parallel parking, or a truck that’s unloading in the middle of the street. That’s the same sort of task AlphaStar has learned: keep away when it’s dangerous, press ahead when it’s safe.

AlphaStar learned to play in two stages that used two different machine learning techniques.

Stage 1: Imitation learning

Blizzard, the company that makes StarCraft, has released millions of anonymized human-played games for use in AI research. This allowed DeepMind to use a technique called supervised imitation learning (one of a family of techniques that fall under the umbrella of imitation learning). Supervised imitation learning attempts to predict actions from states. In StarCraft, a state would be everything that AlphaStar knows or perceives about the current match at any given moment. An action would be anything it does in the game. After observing a huge number of state-action pairs in those millions of games played by humans, it tries to predict what a human would do and does that.

Supervised imitation learning is just the application of supervised learning, by far the most popular machine learning technique, to the problem of imitating human behaviour. A typical application of supervised learning is image recognition. It goes like this:

Pay humans to label 1 million photos of various objects with labels like “tree”, “pickle”, “doll”, “key”, etc. Train a neural network – a network of artificial neurons – on the set of 1 million labelled images. The network learns to predict labels like “pickle” and “doll” from images of pickles and dolls (if there are enough good examples of each object type). When you show the neural network new images of pickles and dolls that it hasn’t seen before, it can guess the correct label over 80% of the time.

Supervised imitation learning applies this same approach to states and actions, rather than images and labels. One way to think of it is that a human StarCraft player “labels” each state with the correct action.

After using just supervised imitation learning, DeepMind estimates that AlphaStar was at the level of a human player in StarCraft’s Gold or Platinum league. That would put AlphaStar at roughly median human performance: somewhere above the bottom 30% of players, but somewhere below the top 30%. This is an impressive proof of concept for imitation learning.

I wonder if now more companies are going to start using imitation learning. Amazon holds a robotics contest every year where the challenge is to get a robot arm to move sundry objects from one box to another. With a big enough research budget, a company like Amazon could pay thousands of people to control robot arms and thereby create a dataset for imitation learning.

Stage 2: Reinforcement learning

Reinforcement learning is trial and error. An AI or “agent” takes a random action, and tries to determine its effect on its reward (i.e. its score in a points system devised by human engineers). For instance, an agent might observe the correlation (over a large number of trials) between taking a certain action and winning a game of StarCraft – or sub-goals that engineers have defined, like increasing resources or avoiding damage.

Reinforcement learning is analogous to the Darwinian evolutionary process of random mutation and natural selection. Given enough exploration of possibility space – following paths of incremental improvement – remarkable things can evolve.

After being trained with imitation learning, DeepMind trained AlphaStar with a form of reinforcement learning called “self-play”. AlphaStar entered into an epic 200-year StarCraft tournament with 300 different versions of itself. The 300 AlphaStars played for a combined 60,000 years. The final form of AlphaStar that played against MaNa is an amalgam of all the AlphaStars from the tournament, a single agent that combines their strengths.

Imitation learning followed by reinforcement learning is a one-two punch I suspect we could see a lot of in the future. If you can use imitation learning to get an AI to a decent level of performance, you cut down on the amount of random exploration that has to be done. With an activity like StarCraft, where the space of possible of actions is astronomical, and where complex sequences of actions are required (e.g. make workers to get resources to make buildings to make combat units to attack the opponent), imitation learning can create a starting point that might require a prohibitive amount of computation to get to with random exploration.

The AlphaStar approach to self-driving cars

AlphaStar raises a big question: is it harder for a neural network to learn to drive a car in a city than to learn to play StarCraft well enough to defeat a professional? If the answer is no, then self-driving cars are imminent. If the answer is yes, then we need to understand what makes driving a harder problem for neural networks than professional-level StarCraft.

In October 2016, Tesla announced that all the cars it produced from then on would have the hardware it believed to be sufficient for full self-driving. This includes 360-degree cameras and ultrasonics, and a forward-facing radar. Recently, reporter Amir Efrati cited unnamed sources who claim that Tesla is leveraging that hardware for supervised imitation learning:

Tesla’s cars collect so much camera and other sensor data as they drive around, even when Autopilot isn’t turned on, that the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations – for example, how to steer a curve on a road or avoid an object. …Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team. They envision a future in which humans won’t need to write code to tell the car what to do when it encounters a particular scenario; it will know what to do on its own.

By current estimates, Tesla has sold about 390,000 cars with the latest hardware, which drive 12.5 million miles per day. The fleet is increasing by over 5,000 cars per week. The rate at which Tesla manufactures and sells cars is also expected to increase significantly over time, possibly to 8,000 per week by the end of 2019. This means the Tesla fleet is accumulating miles at a fast and accelerating rate. By the time the fleet reaches a little over 1 million cars, it will be driving 1 billion miles per month.

Tesla, and Tesla alone, has access to a massive dataset of state-action pairs for use in supervised imitation learning. If Tesla is as successful as DeepMind was with AlphaStar, its AI will learn to drive about as competently as the median human driver. That would be revolutionary. It sounds almost too good to be true, and yet that is not in itself a reason to doubt it.

Waymo (formerly the Google self-driving car project) tested out supervised imitation learning on a very small scale. Its neural network, ChauffeurNet, trained on 60 continuous days of driving, which translates to 25,000 to 75,000 miles, depending on average speed. When moved off course, ChauffeurNet was able to recentre itself in its lane with a 100% success rate. ChauffeurNet has a 90% success rate in nudging around a parked car, including scenarios where its starting speed is set so high that even a human driver might not be able to avoid a collision.

Tesla has the opportunity to try the same supervised imitation learning approach with 100,000 times more data. Even if the end result is not satisfactory, the agent trained with imitation learning can be dropped into a driving simulator. Somewhat like AlphaStar, it can share the road with many different versions of itself in order to experience at least a rough approximation of what it’s like driving amongst humans.

If reinforcement learning in simulation yields significant, measurable improvements in autonomous driving, that would unlock a new scale of funding and therefore a new scale of computational resources. DeepMind spent about $4 million worth of computation on training AlphaStar. For a technology that could easily generate billions in profit in a short amount of time, investing $400 million in computation would be a rational investment.

In support of the reinforcement learning approach, Mobileye (now owned by Intel) has apparently achieved some success by doing reinforcement learning from scratch – with no imitation learning beforehand. Mobileye has some interesting demo videos showing the system in action.

OpenAI (the non-profit co-founded by Elon Musk and Sam Altman, among others) used reinforcement learning from scratch to train an AI to play Dota 2, another popular competitive game. The AI, called OpenAI Five, has yet to defeat a professional team, but it did beat a team of humans who all individually rank in the top 0.5% of Dota players. OpenAI Five also apparently learned some complex behaviours, such as using spells and abilities at opportune times, and anticipating the movement of a fleeing opponent.

OpenAI’s Chief Scientist, Ilya Sutskever, says that prior to OpenAI’s work on Dota, the conventional wisdom among AI researchers was that reinforcement learning fundamentally wasn’t capable of solving hard problems. Sutskever argues that OpenAI Five demonstrates that, up until recently, reinforcement learning just wasn’t being used with the scale of training data required to solve hard problems. OpenAI Five trained on 180 years of self-play per day.

To summarize:

Waymo’s ChauffeurNet shows some interesting results for imitation learning applied to autonomous driving.

OpenAI Five is a compelling demonstration of reinforcement learning from scratch. Mobileye also claims to have had some success in applying reinforcement learning from scratch to autonomous driving.

And AlphaStar shows the effectiveness of combining both imitation learning and reinforcement learning.

Tesla, with its unique scale of training data for imitation learning, is the only company currently in a position to try the AlphaStar approach to autonomous driving. So far it’s only been reported that Tesla is using imitation learning, but it seems logical to augment the AI’s driving with reinforcement learning, especially if imitation learning falls short of human-level driving ability. A posting for an intern position says Tesla is looking for candidates with expertise in reinforcement learning, among other topics.

AlphaStar and OpenAI Five both show how autonomous cars might quickly progress from basically non-functional to superhuman in a relatively short timespan. AlphaStar’s training was completed in a matter of weeks. AI is a domain where progress – as measured by how intelligently an AI behaves – can be slow or non-existent for a long time, and then happen in a sudden flash, as if out of nowhere. I think this is because teams spend a long time plugging away at software engineering work, setting up the training system before finally letting it loose. Once training begins, sometimes it doesn’t take long to start producing unexpectedly good results.

This means that, with autonomous cars, we can’t simply assume that progress will continue at the same rate it has for the last 15 years. We can’t look to the past and assume the future will mirror it. There are new technologies now, being applied with a new scale of training data. If imitation learning and reinforcement learning fail, there has to be a good explanation of their failure – some specific reason why the behaviours of human drivers are not learnable by neural networks using these techniques. Why, exactly, might these techniques succeed or fail? That’s the deeper level of understanding we should strive for.