Welcome to the first of Unity’s new AI-themed blog entries! We have set up this space as a place to share and discuss the work Unity is doing around Artificial Intelligence and Machine Learning. In the past few years, advances in Machine Learning (ML) have allowed for breakthroughs in detecting objects, translating text, recognizing speech, and playing games, to name a few. That last point, the connection between ML and games, is something very close to our hearts here at Unity. We believe that breakthroughs in Deep Learning are going to create a sea-change in how games are built, changing everything from how textures and 3D-models are generated, to how non playable characters (NPCs) are programmed, to how we think about animating characters or lighting scenes. These blog entries are a creative space to explore all these emerging developments.

Who are these blog entries for?

It is our objective to inform Unity Game Developers about the power of AI and ML approaches in game development. We also want to show Artists the opportunities of using AI techniques in content creation. We will take this as an opportunity to demonstrate to ML Researchers the potential of Unity as platform for AI research/development. This includes, demonstrating the Industry potential of Unity as a simulation platform for robotics and self-driving cars. And finally, to getting Hobbyists/Students excited about both Unity and ML.

Over the next few months, we hope to use this space to start discussions and build a community around these concepts and use-cases of Unity. Multiple members of the Unity ML team and other related teams within Unity will post here discussing the different connections between Unity and Machine Learning. Whenever possible, we will release open source tools, videos, and example projects to help the different groups mentioned above utilize the ideas, algorithms, and methods we have shared. We will be monitoring this space closely, and encourage the Unity community to contribute to commentary as well.

Why Machine Learning?

To begin the conversation, we want to spend this first entry talking specifically about the relationship between ML and game AI. Most game AI that currently exists is hand coded, consisting of decision-trees with sometimes up to thousands of rules. All of which must be maintained by hand, and thoroughly tested. In contrast, ML relies on algorithms which can make sense of raw data, without the need of an expert to define how to interpret that data.

Take for example the computer vision problem of classifying the content of an image. Until a few years ago, experts would write filters by hand that would extract useful features for classifying an image as containing a cat or dog. In contrast, ML and in particular the newer Deep Learning approaches, only need the images and class labels and learn the useful features automatically. We believe that this automated learning can help simplify and speed up the process of creating games for developers both big and small, in addition to opening up the possibilities of the Unity platform being used in a wider array of contexts such as simulations of ML scenarios.

This automated learning can be applied specifically to game agent behavior a.k.a. NPCs. We can use Reinforcement Learning (RL) to train agents to estimate the value of taking actions within an environment. Once they have been trained, these agents can take actions to receive the most value, without ever having to be explicitly programmed how to act. The rest of this post is going to consist of a simple introduction to Reinforcement Learning (RL) and a walkthrough of how to implement a simple RL algorithm in Unity! And of course all the code used in this post is available in the Github repository here. You can also access a WebGL demo.

Reinforcement Learning with Bandits

As mentioned above, a core concept behind RL is the estimation of value, and acting on that value estimate. Before going further, it will be helpful to introduce some terminology. In RL, what performs the acting is called an agent, and what it uses to make decisions about its actions is called a policy. An agent is always embedded within an environment and at any given moment the agent is in a certain state. From that state, it can take one of a set of actions. The value of a given state refers to how ultimately rewarding it is to be in that state. Taking an action in a state can bring an agent to a new state, provide a reward, or both. The total cumulative reward is what all RL agent try to maximize over time.

The simplest version of an RL problem is called the multi-armed bandit. This name is derived from the problem of optimizing pay-out across multiple slot machines, also referred as “single-arm bandits” given their propensity for stealing quarters from their users. In this set-up, the environment consists of only a single state, and the agent can take one of n actions. Each action provides an immediate reward to the agent. The agent’s goal is to learn to pick the action that provides the greatest reward.

To make this a little more concrete, let’s imagine a scenario within a dungeon-crawler game. The agent enters a room, and finds a number of chests lined up along the wall. Each of these chests have a certain probability of containing either a diamond (reward +1) or an enemy ghost (reward -1).

The goal of the agent is to learn which chest is the most likely to have the diamond (say, for example, third from the right). The natural way to discover which chest is the most rewarding is to try each of the chests out. Indeed, until the agent has learned enough about the world to act optimally much of RL consists of simple trial and error. Bringing the example above back to the RL lingo, the “trying out” of each chest corresponds to taking a series of actions (opening each chest multiple times), and the learning corresponds to updating an estimate of the value of each action. Once we are reasonably certain about our value estimations, we can then have the agent always pick the chest with the highest estimated value.

These value estimates can be learned using an iterative process in which we start with an initial series of estimates V(a), and then adjust them each time we take an action and observe the result. Formally, this is written as:

Intuitively, the above equation is stating that we adjust our current value estimate a little bit in the direction of the obtained reward. In this way we ensure we are always changing our estimate to better reflect the true dynamics of the environment. In doing so, we also ensure that our estimates don’t become unreasonably large, as might happen if we simply counted positive outcomes. We can accomplish this in code by keeping a vector of value estimates, and referencing them with the index of the action our agent took.

Contextual Bandits

The situation described above lacks one important aspect of any realistic environment: it only has a single state. In reality (and any game world), a given environment can have anywhere from dozens (think rooms in a house) to billions (pixel configurations on a screen) of possible states. Each of these states can have their own unique dynamics in terms of how actions provide new rewards or enable movement between states. As such, we need to condition our actions, and by extension our value estimates, on the state as well. Notationally, will now use Q(s, a)instead of just V(a). Abstractly this means that the reward we expect to receive is now a function of both the action we take, and the state we were in when taking that action. In our dungeon game, the concept of state can enable us to have different sets of chests in different rooms. Each of these rooms can have a different ideal chest, and as such our agent needs to learn to pick different actions in different rooms. We can accomplish this in code by keeping a matrix of value estimates, instead of simply a vector. This matrix can be indexed with [state, action].

Exploring and Exploiting

There is one more important piece of the puzzle to getting RL to work. Before our agent has learned a policy for taking the most rewarding actions, it needs a policy that will allow it to learn enough about the world to be sure it knows what optimal actually is. This presents us with the classic dilemma of how to balance exploration (learning about the environment’s value structure through trial and error) and exploitation (acting on the environments learned value structure). Sometimes these two goals line up, but often they do not. There are a number of strategies to take in balancing these two goals. Below we have outlined a few approaches.

One simple, yet powerful strategy follows the principle of “optimism in the face of uncertainty.” The idea here is that the agent starts with high value estimates V(a) for each action, so that acting greedily (taking the action with the maximum value) will lead the agent to explore each of the actions at least once. If the action didn’t lead to a good reward, the value estimate will decrease accordingly, but if it did, then the value estimate will remain high, as that action might be a good candidate to try again in the future. By itself though, this heuristic is often not enough, since we might need to keep exploring a given state to find an infrequent, but large reward.

Another strategy is to add random noise to the value estimates for each action, and then act greedily based on these new noisy estimates. With this approach, as long as the noise is less than the difference between the true optimal action and the other actions, it should converge to optimal value estimates.

We could also go one step further and take advantage of the nature of the value estimates themselves by normalizing them, and taking actions probabilistically. In this case if the value estimates for each action were roughly equal, then we would take actions with equal probability. On the flip side, if one action had a much greater value estimate, then we would pick it more often. By doing this we slowly weed out unrewarding actions by taking them less and less. This is the strategy we use in the demo project .

Going Forward

With this blog and the accompanying code you should now have all the pieces needed to start working with multi-armed and contextual bandits in Unity. This is all just the beginning. In a follow-up post we will go through Q-learning in full RL problems, and from there start to tackle learning policies for increasingly complex agent behavior in visually rich game environments using deep neural networks. Using these more advanced methods, it is possible to train agents which can serve as companions or opponents in genres ranging from fighting and driving games, to first person shooter, or even to real-time strategy games. All without writing rules, and focusing on what you want the agent to achieve instead of how to achieve it.

In the next few postings we will also be providing an early release of tools to allow researchers interested in using Unity for Deep RL research to connect their models written with frameworks such as Tensorflow or PyTorch to environments made in Unity. On top of all that we have a lot more planned for this year beyond agent behavior, and we hope the community will join us as we explore the uncharted territory that is the future of how games are made!

You can read the second part of this blog series here.