What Will I Learn?

In this article we are going to build a simple reinforcement learning (RL) agent that can successfully land a rocket in the video game Lunar Lander. RL is a massive topic and I’m not going to cover everything here in detail. Instead the aim of this project is to get your hands dirty with some practical reinforcement learning and get a feel for it. More in depth articles on the various topics will be done in the future. This project will cover the following:

Core Principles of Reinforcement Learning Building a simple neural network with pytorch Using the cross entropy method (CEM) and Deep Learning to safely land a 2D rocket Using OpenAi gym to train intelligent agents that can solve various environments ranging from robotics to video games

The full code can be found here on my github. If you want to quickly follow along with everything already set up, click the “run on FloydHub” button below. This will setup a workspace on FloydHub’s amazing cloud platform with all the dependencies and environment requirements pre-installed.

Intro

Advancements in AI are skyrocketing at the moment (no pun intended). Researchers are developing ground breaking ideas every year and we are still in the infancy of this amazing field. One of the most exciting areas in artificial intelligence is reinforcement learning. RL has been responsible for some of the coolest examples of AI, such as OpenAI developing an agent that beat the top pro human players in the e-sport DOTA 2! For an AI to be able to generilize and learn complex strategies in a game as sophisticated as DOTA 2 is a huge achievement and will push research even closer to achieving artificial general intelligence (AGI). Aside from achieving super human levels of performance in games, RL can be applied to a broad range of fields including financial trading, natural language processing, healthcare, manufacturing and education.

What is Reinforcement Learning?

Like most advances in artificial intelligence, RL is derived from studying the intelligence of humans. The core concepts of RL comes from behaviorism, which basically says that everything we do in life is a reflex response to our current environment or a consequense of our past actions.

RL uses trial and error to learn to make the best decisions possible by achieving the best reward possible over a period of time. A good example of this is how you would train a dog. Everytime the dog rolls over when you tell them to, they get a treat (reward +1). Everytime they pee on the rug they get yelled at (reward -1). Over time they learn to do the things that get them the most positive rewards and avoid the things that get the most negative rewards.

Due to the fact that RL uses unsupervised learning, it doesn’t need to be told how to achieve something, it just needs to know what to achieve. This means it can find solutions to problems that humans may never have even thought of.

The Components Of Reinforcement Learning

As the agent makes its way through the environment it goes through a type of learning loop. This is shown above. The agent identifies what state it is currently in and decides the best action to take given its current state. When this action is carried out in the environment the agent observes the new state after taking that action as well as the reward it recieved for taking an action while in that state. Through this method our agent learns.

The game loop of an RL environment. Image taken from here

The Agent

The agent is our AI, the hero of our story. Fun fact, if you give your agent a name, it will train at least 2X faster! Lets call our agent Tim. Tim contains the RL algorithm and neural network that allows it to make decisions based on what it has learned so far. In this project we will be using the cross entropy method as our RL algorithm.

Fair warning, it is possible to become emotionally attached to your agent while it is training. You may even find yourself shouting at the training logs on your monitor as little Timmy fails to improve for the 5th epoch in a row. This will also give you a nice preview as to what kind of your parent you will be (most likely, inpatient and overly critical ).

The Environment

Lunar Lander Game

The environment is where our agent lives and carries out its tasks. In this project the environment is the Lunar Lander game. All you need to know is that the environment gives us the states as observations to the agent and the rewards that the agent recieves as it tries to beat the environment. Down the road you might want to build your own environments to try and solve specific tasks, but thats a lot of work. So for now we are going to use OpenAI’s gym environment. This provides us with a huge array of learning environments to picks from. If you are running this project locally check out the OpenAI documentation for getting gym set up here. If you dont want to go through the hastle of setting this up, I’d advise using the “run on floydhub” button at the start of the article.

State

The state/observation is just the current state of the environment. For example, in the Tic-Tac-Toe the current state of the environment would be what pieces are on what squares. This simple environment could represent its state with a simple matrix, but for more complex environments we use the pixel data from the screen as the current state. This uses computer vision to understand what is going on in a current state. This is a more advanced feature and wont be covered here.

Action

Like the state this is pretty self explanitory. For each step in the environment the agent carries out an action based on the current state it is in. In this project the lunar lander has 4 possible actions it can take. do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

Reward

The reward is simply the feedback the agent gets from interacting with the environment. This can be positive or negative and is a very important factor in how the agent learns. In the Lunar Lander example the agent recieves a positive reward for landing closer to the target zone. The agent gets a negative reward for moving further away from the landing zone. Also, the agent recieves a small negative reward every time it carries out an action. This is done in an attempt to teach the agent to land the rocket as quickly and efficiently as possible. If we were to simply give it a reward for landing the rocket the agent would be able to do it, but it might take much longer than it should and use excess fuel. This is because there is no downside to doing so.

Assigning the correct reward function is very important and is a key factor in how the agent performs. Deciding the right reward for a task isn’t always easy or straight forward and can have huge effects on the agent. Remember, we want to tell the agent what to do, not how to do it. In this example we are telling the agent to land the rocket at a given location as efficiently as possible. How the agent does this is completely up to itself.

Cross Entropy Method

Now that we have gone over the basics of what RL is, lets dive into the method that our agent will use to learn. Our agent is going to play through hundreds of episodes of the environment and record the actions taken carried and the state the agent was in for every step of the episode. The total reward recieved for that episode is also recorded. We will generate a batch of these episodes, ~100 episodes per batch. Once we have gathered the data from our batch of episodes we pick the episodes that performed the best in that batch. This is similar to the way evolutionary algorithms work as it is enforcing the “survival of the fitest” methodology.

We then take these elite episodes and run them through our neural network. The states are used as the input and the actions taken are the targets output. By doing this, are network learns what actions to take given a certain state.

There are many different types of RL models that have different methods of learning. For the moment all you need to know is that CEM is defined with the following 3 labels. Model-free, policy-based and on-policy learning. Below is a quick explanation of each

Model Free vs Model Based: Model Based methods try and make a model of the environment and predict what will happen next in order to make the best decisions. Model Free methods directly connects observations and actions to make optimal decisions [1] Policy Based vs Value Based: Policy Based agents builds a policy over time that determines the probablity of taking a certain action in a given state. Value Based agents calculates the value of taking every action and uses that value to choose the best action. [1] Off Policy vs On Policy: This determines how our agent learns from experience. Off Policy learns from old data previously gathered by the agent with a different policy than our current one. where as On Policy learns from new data as it comes in using the current policy[2].

Why Use CEM?

If you have done any research on RL you will quickly see that CEM doesn’t really get much love and is over shadowed by more popular methods like Deep Q Learning (DQN) and Advantageous Asynchronous Actor Crtic (A3C). So why are we not using one of those? CEM is a great method for beginners as it is a simple and intuative method that can be written in ~100 lines of code. Not only is it easy to follow, but it performs quite well and will provide a great baseline to compare against future projects .

CEM Building Blocks

To implement the deep cross entropy method we need to follow 4 steps

1. Generate Sessions:

Play through several episodes of the game environment with our current agent and save the actions, states and rewards used for each episode

2. Retrieve Elite Sessions:

We want to only learn off episodes that achieved a high score in that batch. We determine our elite threshold by taking some percentile of all the episode rewards for that batch of generated sessions.

3. Train On Elite Sessions

Now that we have the top X% of our batch, we train our model on those experiences. We use the recorded states as input and the actions carried out as the target output.

4. Rinse and Repeat

We keep repeating this process until our model converges on a successfull policy

Now that we have our blueprint of how to build the agent the last thing we need to cover is applying deep learning to the CEM.

Deep Learning

Traditional CEM uses a matrix or table to hold the policy. This matrix contains all of the states in the environment and keeps track of the probability of taking each possible action while in this state. As you can imagine, this method is only suitable to environments with small, finite state spaces. In order to build agents that can learn to beat larger, more complex environments we can’t use a matrix to store our policy.

This is where deep learning comes into play. Instead of a matrix, we are going to use a neural network that learns to approximate what action to take depending on the state that was given as input to the network. If you are unfamiliar with neural networks check out my previous article on building a neural network from scratch here. For this project we wont be building the entire network from scratch, instead we will be using the popular deep learning library pytorch. The full code for our network can be seen here

import gym

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim class Net(nn.Module): def __init__(self, obs_size, hidden_size, n_actions):

super(Net, self).__init__()

self.fc1 = nn.Linear(obs_size, hidden_size)

self.fc2 = nn.Linear(hidden_size, n_actions)



def forward(self, x):

x = F.relu(self.fc1(x))

return self.fc2(x)

Pytorch has a standard convention for building its networks. First we make a new class called Net which inherits from the nn.Module class. Here we initialise our two fully connected layer that makes the core of our neural network. The first fully connected layer (fc1) is our input layer which takes in a tensor the same size as our state size and outputs a tensor that is the size of our hidden nodes (in this case its 200). The second fully connected layer (fc2) is our hidden layer, it takes in the output from our previous layer and outputs a tensor that is the size of our action space (in this case 4) it will output a number for each possible action our agent can take.

def __init__(self, obs_size, hidden_size, n_actions):

super(Net, self).__init__()

self.fc1 = nn.Linear(obs_size, hidden_size)

self.fc2 = nn.Linear(hidden_size, n_actions)

Next we need to write out our foward method. This is required by the nn.Model class. This method is automatically used when we pass in data into our network object. You can see below that we take in a tensor x which is the game state the agent is observing. We pass that state through the first layer of our neural network and apply a ReLU activation function to the output of fc1. Next we take that output and pass it through our second layers. This value is then returned as the output of the whole network.

def forward(self, x):

x = F.relu(self.fc1(x))

return self.fc2(x)

Usually we would add an activation layer after the final output layer, such as the softmax function. The softmax function takes the numbers that were outputed for each possible action and normalises them so that they all add up to 1. By doing this we know the correct probability of taking each action.

example of what the state of the game looks like as it is being passed through our network. This goes from input to hidden layers and finally through our activation function to give us our probability distribution

We would then calculate the cross entropy loss to find out how far off our predictions were. Instead we use the pytorch class nn.CrossEntropyLoss later on in the project. This class carries out both the softmax activation and cross entropy loss in one in order to provide a more stable function [1]. The only thing you need to remember is that we need to apply the softmax activation function when we want to see the probablity of taking an action given our current state.

Generate Sessions

This is where we generate our batch of episodes. The agent will play through N episodes and gather the actions/states for each step so we can train our agent. Here is the full code for the method

def generate_batch(env,batch_size, t_max=1000):



activation = nn.Softmax(dim=1)

batch_actions,batch_states, batch_rewards = [],[],[]



for b in range(batch_size):

states,actions = [],[]

total_reward = 0

s = env.reset()

for t in range(t_max):

s_v = torch.FloatTensor([s])

act_probs_v = activation(net(s_v))

act_probs = act_probs_v.data.numpy()[0]

a = np.random.choice(len(act_probs), p=act_probs) new_s, r, done, info = env.step(a) states.append(s)

actions.append(a)

total_reward += r s = new_s

if done:

batch_actions.append(actions)

batch_states.append(states)

batch_rewards.append(total_reward)

break

return batch_states, batch_actions, batch_rewards

First off we make our activation function, as we described earlier. Next we need three lists to store our episode data. The first two are batch_actions and batch_states. These are actually a list of lists. Each index stores all of the actions/states for a particular episode. Then batch_rewards stores the total reward achieved during each episode.

activation = nn.Softmax(dim=1)

batch_actions,batch_states, batch_rewards = [],[],[]

Next we iterate through our batch size, running an episode for each iteration. In our first loop we initialise two empty lists to store our actions/states for this episode. We also create a variable to count the total reward of the episode. These are our data variables. Finally we initialise our state variable s with a fresh episode by calling env.reset(), this will start a new game.

for b in range(batch_size):

states,actions = [],[]

total_reward = 0

s = env.reset()

Now we call a second loop that carrys out a single step in the game environment up until we reach our time limit for that episode. First we need to get our current state and pass it through our network. To do this we need to turn our state s into a torch float tensor so we can give it into the network. Next we get the action probability from our network. Remember we have to apply our activation function to the prediciton in order for the probabilities of the actions to all add up to 1 and be usable. Once we have retrieved our probability distribution we can decide what action to take. This is done by using numpys random.choice function. It will choose a “random” action based on the probabilities given. So if our policy says that action 1 has a value of 0.7 and we have three other actions with a probability of 0.1, it is far more likely our action will be action 1.

for t in range(t_max):

s_v = torch.FloatTensor([s])

act_probs_v = activation(net(s_v))

act_probs = act_probs_v.data.numpy()[0]

a = np.random.choice(len(act_probs), p=act_probs)

Once we have decided upon which action to take, the action is carried out in the environment. This will return the new state the reward recieved by taking that action, wether or not the episode is finished and any additional information the environment might provide. Now that we have the information of our updated environment we need to add the state,action and reward to our data variables. Finally we update our current state.

new_s, r, done, info = env.step(a) states.append(s)

actions.append(a)

total_reward += r s = new_s

The last thing we need to do before is check if the episode has finished during this step. If done is True we simply add our actions, states and rewards to their corresponding batch lists. Then break

if done:

batch_actions.append(actions)

batch_states.append(states)

batch_rewards.append(total_reward)

break

Once that is done just return our batch data

return batch_states, batch_actions, batch_rewards

Filter Elite Sessions

This method is used to select only the best episodes from the latest batch. Find the reward threshold, in our case this is the top 20% or the 80th percentile but feel free to play around with that number, and then just take the episode data from episodes with a reward ≥ our reward threshold. To do this we use the handy numpy percentile method. Just give it our list of rewards for our batch of episodes and our chosen percentile and it will do all of the terrifying math for us!

def filter_batch(

states_batch,actions_batch,rewards_batch,percentile=50):



reward_threshold = np.percentile(rewards_batch, percentile)



elite_states = []

elite_actions = []





for i in range(len(rewards_batch)):

if rewards_batch[i] > reward_threshold:

for j in range(len(states_batch[i])):

elite_states.append(states_batch[i][j])

elite_actions.append(actions_batch[i][j])



return elite_states,elite_actions

Training

So now the core of our Deep CEM is complete. Now we just have to utilise our code so far and train the agent.

batch_size = 100

session_size = 100

percentile = 80

hidden_size = 200

learning_rate = 0.0025

completion_score = 200 env = gym.make("LunarLander-v2")

n_states = env.observation_space.shape[0]

n_actions = env.action_space.n #neural network

net = Net(n_states, hidden_size, n_actions)

#loss function

objective = nn.CrossEntropyLoss()

#optimisation function

optimizer = optim.Adam(params=net.parameters(), lr=learning_rate) for i in range(session_size):

#generate new sessions

batch_states,batch_actions,batch_rewards = generate_batch(env, batch_size, t_max=5000) elite_states, elite_actions = filter_batch(batch_states,batch_actions,batch_rewards,percentile)



optimizer.zero_grad() tensor_states = torch.FloatTensor(elite_states)

tensor_actions = torch.LongTensor(elite_actions) action_scores_v = net(tensor_states)

loss_v = objective(action_scores_v, tensor_actions)

loss_v.backward()

optimizer.step() #show results

mean_reward, threshold = np.mean(batch_rewards),

np.percentile(batch_rewards, percentile)

print("%d: loss=%.3f, reward_mean=%.1f,

reward_threshold=%.1f"

% (i, loss_v.item(), mean_reward, threshold))



#check if

if np.mean(batch_rewards)> completion_score:

print("Environment has been successfullly completed!")

This may look like a long block of code but it isn’t actually that scary. First things first, we need to initialise our parameters.

batch_size = 100

session_size = 500

percentile = 80

hidden_size = 200

learning_rate = 0.01

completion_score = 200

batch_size: how many episodes to run at once

session_size: how many training epochs. each epoch runs one batch

percentile: used to determine our elite reward threshold

learning_rate: denotes how much we update our network by during each training step (need to find a good middle ground for this one)

completion_score: average reward over 100 episodes to be considered solved

All of these can be played around with. This next part simply initialises our learning environment

env = gym.make("LunarLander-v2")

n_states = env.observation_space.shape[0]

n_actions = env.action_space.n

Next we need to setup our pytorch neural network. This involves three things.

Initialise the network we made previously Choose a loss function Choose an optimiser.

As you can see, we use CrossEntropyLoss and the Adam optimisation function.

#neural network

net = Net(n_states, hidden_size, n_actions) #loss function

objective = nn.CrossEntropyLoss() #optimisation function

optimizer = optim.Adam(params=net.parameters(), lr=learning_rate)

Now we get to our training loop. We run a loop for the number of sessions given. During each epoch(iteration) we run our generate_batch method to get our batch of episode data.

for i in range(session_size):

#generate new sessions

batch_states,batch_actions,batch_rewards

= generate_batch(env, batch_size, t_max=5000)

Once thats done we filter out the bad episodes and keep the elite ones by calling our filter batch method.

elite_states, elite_actions = filter_batch(batch_states,batch_actions,batch_rewards,percentile)

Once we have the elite episodes that we want to train on we go through the process of passing data through our neural network.

optimizer.zero_grad() tensor_states = torch.FloatTensor(elite_states)

tensor_actions = torch.LongTensor(elite_actions) action_scores_v = net(tensor_states)

loss_v = objective(action_scores_v, tensor_actions)

loss_v.backward()

optimizer.step()

Before each training step we need to set the gradients of our optimiser back to zero. For now all you need to know is that we’re reseting the optimiser. Next we turn our elite_states and elite_actions lists into torch tensors so they can be used with our network.

We then pass all of the elite episode states into our network. It goes through every state collected and predicts what the policy distribution should look like. Next we compare these predictions to the actions that were carried out in our elite episodes. Ideally we want our networks predictions to be close to these.

To find out how far off our network was (the loss) we use the objective function (CrossEntropyLoss). Once we have calculated the loss we use the backward method to calculate the gradients of our loss (backpropagation). Finally our optimizer updates our network by calling the step method.

The last thing to do is show the results and check if we achieved an average score that is higher than the completion score

#show results

mean_reward, threshold = np.mean(batch_rewards),

np.percentile(batch_rewards, percentile)

print("%d: loss=%.3f, reward_mean=%.1f,

reward_threshold=%.1f"

% (i, loss_v.item(), mean_reward, threshold))



#check if

if np.mean(batch_rewards)> completion_score:

print("Environment has been successfullly completed!")

Results