In v0.9 and v0.10 of ML-Agents, we introduced a series of features aimed at decreasing training time, namely Asynchronous Environments, Generative Adversarial Imitation Learning (GAIL), and Soft Actor-Critic. With our partner JamCity, we previously showed that the parallel Unity instance feature introduced in v0.8 of ML-Agents enabled us to train agents for their bubble shooter game, Snoopy Pop, 7.5x faster than with a single instance. In this blog post, we will explain how v0.9 and v0.10 build on those results and show that we can decrease Snoopy Pop training time by an additional 7x, enabling more performant agents to be trained in a reasonable time.

The purpose of the Unity ML-Agents Toolkit is to enable game developers to create complex and interesting behaviors for both playable and non-playable characters using Deep Reinforcement Learning (DRL). DRL is a powerful and general tool that can be used to learn a variety of behaviors, from physics-based characters to puzzle game solvers. However, DRL requires a large volume of gameplay data to learn effective behaviors– a problem for real games that are typically constrained in how much they can be sped up.

Several months ago, with the release of ML-Agents v0.8, we introduced the ability for ML-Agents to run multiple Unity instances of a game on a single machine, dramatically increasing the throughput of training samples (i.e., the agent’s observations, actions, and rewards) that we can collect during training. We partnered with JamCity to train an agent to play levels of their Snoopy Pop puzzle game. Using the parallel environment feature of v0.8, we were able to achieve up to 7.5x training speed up on harder levels of Snoopy Pop.

But parallel environments will only go so far—there is a limit to how many concurrent Unity instances can be run on a single machine. To improve training time on resource-constrained machines, we had to find another way. In general, there are two ways to improve training time: increase the number of samples gathered per second (sample throughput), or reduce the number of samples required to learn good behavior (sample efficiency). Consequently, in v0.9, we improved our parallel trainer to gather samples asynchronously, thereby increasing sample throughput.

Furthermore, we added Generative Adversarial Imitation Learning (GAIL), which enables the use of human demonstrations to guide the learning process, thus improving sample efficiency. Finally, in v0.10, we introduced Soft Actor-Critic (SAC), a trainer that has substantially higher sample efficiency than the Proximal Policy Optimization trainer in v0.8. These changes together improved training time by another 7 times on a single machine. For Snoopy Pop, this meant that we were not only able to create agents that solve levels but agents that solved them in the same # of steps as a human player. With the increased sample throughput and efficiency, we were able to train multiple levels of Snoopy Pop on a single machine, which previously required multiple days of training on a cluster of machines. This blog post will detail the improvements made in each subsequent version of ML-Agents, and how they affected the results in Snoopy Pop.

ML-Agents Toolkit + Snoopy Pop

We first introduced our integration of ML-Agents with Snoopy Pop in our ML-Agents v0.8 blog post. The figure below summarizes what the agent can see, what it can do, and the rewards that it received. Note that compared to our previous experiments with Snoopy Pop, we decreased the magnitude of the positive reward and increased the penalty for using a bubble, forcing the agent to focus its attention less on simply finishing the level and more on clearing bubbles in the fewest number of steps possible, just as a human player would do. This is a much harder problem than just barely winning the level, and takes significantly longer to learn a good policy.

ML-Agents 0.8: Running multiple, concurrent instances of Snoopy Pop

In ML-Agents v0.8 , we introduced the ability to train multiple Unity instances at the same time. While we are limited in how much we can speed up a single instance of Snoopy Pop, multi-core processors allow us to run multiple instances on a single machine. Since each play-through of the game is independent, we can trivially parallelize the collection of our training data.

Each simulation environment feeds data into a common training buffer, which is then used by the trainer to update its policy in order to play the game better. This new paradigm allows us to collect much more data without having to change the timescale or any other game parameters which may have a negative effect on the gameplay mechanics.

ML-Agents v0.9: Asynchronous Environments and Imitation Learning

In ML-Agents v0.9, we introduced two improvements to sample efficiency and sample throughput, respectively.

Asynchronous Environments

In the v0.8 implementation of parallel environments, each Unity instance takes a step in sync with the others, and the trainer receives all observations and sends all actions at the same time. For some environments, such as those provided with the ML-Agents toolkit, the agents take decisions at roughly the same constant frequency, and executing them in lock-step is not a problem. However, for real games, certain actions may take longer than others. For instance, in Snoopy Pop, clearing a large number of bubbles incurs a longer animation than clearing none, and winning the game and resetting the level takes longer than taking a shot. This means that if even one of the parallel environments takes one of these longer actions, the others must wait.

In ML-Agents v0.9, we enabled asynchronous parallel environments. As long as at least one of the environments have finished taking its action, the trainer can send a new action and take the next step. For environments with varying step times, this can significantly improve sample throughput.

Generative Adversarial Imitation Learning (GAIL)

In a typical DRL training process, the agent is initialized with a random behavior, performs random actions in the environment, and may happen upon some rewards. It then reinforces behaviors that produce higher rewards, and, over time, the behavior tends towards one that maximizes the reward in the environment and becomes less random.

However, not all optimal behavior is easy to find through random behavior. For example, the reward may be sparse, i.e. the agent must take many correct actions before receiving a reward. Or, the environment may have many local optima, i.e. places where the agent could go that appear to be leading it towards the maximum reward but is actually an incorrect path. Both of these issues may be possible to solve using brute-force random searching but will require many, many samples to do so. They contribute to the millions of samples required to train Snoopy Pop. In some cases, it may never find the optimal behavior.

But what if we could do a bit better by guiding the agent towards a good behavior by providing it with human demonstrations of the game? This area of research is called Imitation Learning and was added to ML-Agents in v0.3. One of the drawbacks of Imitation Learning in ML-Agents was that it could only be used independently of reinforcement learning, training an agent purely on demonstrations but without rewards from the environment.

In v0.9, we introduced GAIL, which addresses both of these issues, based on research by Jonathan Ho and his colleagues. You can read more about the algorithm in their paper.

To use Imitation Learning with ML-Agents, you first have a human player (or a bot) play through the game several times, saving the observations and actions to a demonstration file. During training, the agent is allowed to act in the environment as usual and gather observations of its own. At a high level, GAIL works by training a second learning algorithm (the discriminator, implemented with a neural network) to classify whether a particular observation (and action, if desired) came from the agent, or the demonstrations. Then, for each observation the agent gathers, it is rewarded by how close its observations and actions are to those in the demonstrations. The agent learns how to maximize this reward. The discriminator is updated with the agent’s new observations, and gets better at discriminating. In this iterative fashion, the discriminator gets tougher and tougher—but the agent gets better and better at “tricking” the discriminator and mimicking the demonstrations.

Because GAIL simply gives the agent a reward, leaving the learning process unchanged, we can combine GAIL with reward-based DRL by simply weighting and summing the GAIL reward with those given by the game itself. If we ensure the magnitude of the game’s reward is greater than that of the GAIL reward, the agent will be incentivized to follow the human player’s path through the game until it is able to find a large environment reward.

ML-Agents v0.10: Soft Actor-Critic

Since its initial release, the ML-Agents Toolkit has used Proximal Policy Optimization (PPO) – a stable, flexible DRL algorithm. In v0.10, in the interest of speeding up your training on real games, we released a second DRL algorithm, SAC, based on work by Tuomas Haarnoja and his colleagues. One of the critical features of SAC, which was originally created to learn on real robots, is sample-efficiency. For games, this means we don’t need to run the games as long to learn a good policy.

DRL algorithms fall into one of two categories–on-policy and off-policy. An on-policy algorithm such as PPO collects some number of samples, learns how to improve its policy based on them, then updates its policy accordingly. By collecting samples using its current policy, it learns how to improve itself, increasing the probability of taking rewarding actions and decreasing those that are not rewarding. Most modern on-policy algorithms, such as PPO, learn a form of evaluation function as well, such as a value estimate (the expected discounted sum of rewards to the end of the episode given the agent is in a particular state) or a Q-function (the expected discounted sum of rewards if a given action is taken at a particular state). In an on-policy algorithm, these evaluators estimate the series of rewards assuming the current policy is taken. Without going into much detail, this estimate helps the algorithm train more stably.

Off-policy algorithms, such as SAC, work a bit differently. Assuming the environment has fixed dynamics and reward function, there exists some optimal relationship between taking a particular action at a given state, and getting some cumulative reward (i.e., what would the best possible policy be able to get?) If we knew this relationship, learning an effective policy would be really easy! Rather than learning how good the current policy is, off-policy algorithms learn this optimal evaluation function across all policies. This is a harder learning problem than in the on-policy case–the real function could be very complex. But because you’re learning a global function, you can use all the samples that you’ve collected from the beginning of time to help learn your evaluator, making off-policy algorithms much more sample-efficient than on-policy ones. This re-use of old samples is called experience replay, and all samples are stored in a large experience replay buffer that can store 100’s (if not thousands) of games worth of data.

For our toolkit, we’ve adapted the original SAC algorithm, which was designed to do continuous action locomotion tasks, to support all of the features you’re used to in ML-Agents – Recurrent Neural Networks (memory), branched discrete actions, curiosity, GAIL, and more.

Performance results in Snoopy Pop

In our previous experiments, we demonstrated that for a complex level of Snoopy Pop (Level 25), we saw a 7.5x decrease in training time going from a single environment (i.e., v0.7 of ML-Agents) to 16 parallel environments on a single machine. This meant that a single machine could be used to find a basic solution to Level 25 in under 9 hours. Using this capability, we trained our agents to go further and master Level 25—i.e., solve Level 25 to human performance. Note this takes a considerably longer time than simply solving the level—an average of about 33 hours.

Here, we declare an agent to have “mastered” a level if it reaches average human performance (solves the level at or under the number of bubbles a human uses) over 1000 steps. For Level 25, this corresponds to 25.14 steps/bubbles shot, averaged from 21 human plays of the same level.

We then tested each improvement from v0.9 and v0.10 incrementally, measuring the time it takes to exceed human performance at the level. ll in all, they add up for an additional 7x speedup to mastering the level! Each value shown is an average over three runs, as training times may vary between runs. Sometimes, the agent gets lucky and finds a good solution quickly. All runs were done on a 16-core machine with training accelerated by a K80 GPU. 16 instances were run in parallel during training.

For the GAIL experiments, we used the 21 human playthroughs of Snoopy Pop as demonstrations to train the results. Note that the bubble colors in Level 25 are randomly generated, so in no way do the 21 playthroughs cover all possible board configurations of the level. If so, the agent would learn very fast by memorizing and copying the player behavior. We then mixed a GAIL reward signal with the one provided by the Snoopy Pop game, so that GAIL can guide the agent’s learning early in the process but allow it to find its own solution later.

Parallel Environments (v0.8) Asynchronous Environments (v0.9) GAIL with PPO (v0.9) SAC (v0.10) GAIL with SAC (v0.10) Time to Reach Human Performance (hours) 34:03 31:08 23:18 5:58 4:44 Sample Throughput (samples/second) 10.83 14.81 14.51 15.04 15.28

Let’s visualize the speedup in graph format below. We see that the increase in sample throughput by using asynchronous environments results in a reduction of training time without any changes to the algorithm. The bigger reductions in training time, however, come from improving the sample efficiency of training. Note that sample throughput did not change substantially between ML-Agents v0.9 and v0.10. Adding demonstrations and using GAIL to guide training meant that the agent used 26% fewer samples to reach the same training behavior, and we see a corresponding drop in training time. Switching to Soft Actor-Critic, an off-policy algorithm, meant that the agent solved the level with 81% fewer samples than vanilla PPO, and additional improvement is seen by adding GAIL to SAC.

These improvements aren’t unique to the new reward function and goal of reaching human performance. If we task SAC+GAIL with simply solving the level, as we had done in our previous experiments, we are able to do so in 1 hour, 11 minutes, vs. 8 hours, 24 minutes.

Next steps

If you’d like to work on this exciting intersection of Machine Learning and Games, we are hiring for several positions, please apply!

If you use any of the features provided in this release, we’d love to hear from you. For any feedback regarding the Unity ML-Agents Toolkit, please fill out the following survey and feel free to email us directly. If you encounter any issues or have questions, please reach out to us on the ML-Agents GitHub issues page.