Creating Deep Neural Networks from Scratch, an Introduction to Reinforcement Learning

Part I: The Gym Environment and DNN Architecture

Reinforcement learning for pets! [Image credit: Stephanie Gibeault]

This post is the first of a three part series that will give a detailed walk-through of a solution to the Cartpole-v1 problem on OpenAI gym — using only numpy from the python libraries. This solution is far from an optimal solution (you can find those on the gym website), but rather is focused on doing it from first principles.

Pre-requisites for running the code in this article are python (3.x), with gym and numpy modules installed.

When I first started looking at reinforcement learning in the OpenAI gym, I was unable to find any good resources on how to begin building the solution myself. There are very powerful libraries (like Tensorflow and Pytorch) that allow you to build incredibly complex neural networks and solve the cartpole problem easily, but I wanted to create the neural networks from scratch, as I believe there is value in understanding the core building blocks of modern machine learning techniques. I’m writing what I wish I had been able to find when I was trying to work on this. So let’s get started.

First, what is OpenAI gym? There is a good, short intro on their website, “Gym is a toolkit for developing and comparing reinforcement learning algorithms.” “The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.” What this means is that the engineering around building and rendering models that simulate real world scenarios is already done for us, so we can just focus on teaching an agent to play the game well.

The description above also mentions reinforcement learning. What is that? Here’s a wikipedia summary— “Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize some notion of cumulative reward.”

To give you an analogy, think about how a dog is trained — favorable actions are positively reinforced (in the form of a treat) and negative actions are negatively reinforced. In a way even we, humans, are complex reinforcement learning agents trying to maximize the chance of achieving our goals by selecting actions that we think will ‘benefit’ us (in the form of greater reward) in the future. Here’s a figure that illustrates the cycle in reinforcement learning,

The reinforcement learning cycle [Image credit: Mohit Mayank]

The above figure shows an agent (the program that we will build) taking as inputs the state of the environment and reward from the previous action, selecting a subsequent action and feeding that back to the environment, before observing the environment once again.

Great, now that we have an understanding of the basic concepts in reinforcement learning, let’s go back the problem we are trying to solve — Cartpole. To begin with, take a look at the documentation for the Cartpole problem specifically. The documentation gives a good overview of what we are trying to achieve. In a nutshell, we are in control of the base of a slider with a pole balanced vertically on top. Our goal is to prevent the pole from falling off for as long as possible. If the pole falls (in terms of its angle) below a certain point, the environment is reset. Below is a random agent working on the cartpole problem.

Random agent on Cartpole

As you can see, it’s not very good! But that is expected, since this agent disregards the current state of the environment and selects a random action at every time step. Let’s see if we can do better.

Implementation

Time for some code! We’ll start by importing the libraries that we will be using. We will need gym for the OpenAI environments as discussed above, and numpy for some math and matrix manipulations.

import gym

import numpy as np

Next, we need to import the environment that gym provides for the cartpole problem. Here’s how this is done:

env=gym.make('CartPole-v1')

We can also observe some of the features of this particular environment space by printing them:

print(env.action_space) # Discrete(2)

print(env.observation_space) # Box(4,)

There is much more information about environments and their workings in the docs, but the values above capture the basic elements that define this environment — the actions that can be performed, and the observations at every time step.

The action space is discrete and contains 2 values: 0 and 1. These correspond with the two actions that the agent is able to perform, i.e. push the slider towards the left or towards the right.

The observation space on the other hand is continuous and has four components (not to be thrown off with the data structure Box(4,) , for our purposes it just means an array containing four values). What do the four values mean? They are numbers that represent the state of the environment at that time — namely, the position of cart, the velocity of cart, the angle of pole, and the rotation rate of the pole. [https://github.com/openai/gym/issues/238#issuecomment-231129955]

A fundamental thing to understand here is that the meaning of the numbers in the observation and action spaces is explained only for completeness and our goal is not to interpret the values (of either the action space or the observation space) but let the agent learn the meaning of these values in context. Let us get back to our program and add code to get a basic loop running.

# Global variables

NUM_EPISODES = 10

MAX_TIMESTEPS = 1000 # The main program loop

for i_episode in range(NUM_EPISODES):

observation = env.reset()

# Iterating through time steps within an episode

for t in range(MAX_TIMESTEPS):

env.render()

action = env.action_space.sample()

observation, reward, done, info = env.step(action)

if done:

# If the pole has tipped over, end this episode

break

The code above declares a main program loop for the episodes and iterates through the time steps within an episode. In the internal loop, the program takes an action, observes the result and then checks if the episode has concluded (either the pole has fallen over or the slider has gone off the edge). If it has, the environment is reset and the internal loop starts over.

The line that selects the action is randomly sampling it from the available actions; in fact, it behaves exactly like the random agent shown earlier. Let’s change that to define a custom method for selecting the action given the observation.

Now, we will define an agent that is (hopefully!) going to learn to be smarter in picking its actions given a state. We will model this agent in a class:

class RLAgent:

env = None def __init__(self, env):

self.env = env



def select_action(self, observation):

return env.action_space.sample()

We also need to add an instance of RLAgent to our global variables and change the action selection to call this function from the instantiated class. Note that at present the select_action function does the same thing as before, but we will change that later.

Neural Networks

We will now create the elements of our neural net. A quick primer on neural networks: “An ANN (Artificial Neural Network) is a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it.”

This is what ours will look like,

The neural network

The above picture captures a neural network that has one input layer, two ‘hidden’ layers (layers 2 & 3) and an output layer. The model provides inputs from the observation space to the input layer, these are ‘fed forward’ to subsequent layers until the output layer, where the values in the output layer are used to select the action.

For instance, every node in layer 3 is a linear combination (weighted sum) of layer 2 passed through an activation function. The weights used to calculate layer 3 are initialized randomly in a matrix and gradually tuned through a process called stochastic gradient descent to better predict the outputs. The activation function is a simple non-linear function that allows the classifier to learn non-linear rules in the underlying observation space.

In our CartPole problem, there are 5 inputs (the elements of the observation space + a bias term) and 2 outputs (the two directions in which we can push the cart).

The neural net layers are going to be encapsulated in an NNLayer class,

class NNLayer:

# class representing a neural net layer

def __init__(self, input_size, output_size, activation=None, lr = 0.001):

self.input_size = input_size

self.output_size = output_size

self.weights = np.random.uniform(low=-0.5, high=0.5, size=(input_size, output_size))

self.activation_function = activation

self.lr = lr

This class captures three major things:

The dimensions (input and output sizes) of the layer. The weights that connect the input layer to the output layer. The activation function for the output (the default activation is None, aka Linear).

We will now add the usage of this class to our RLAgent. First, we’ll edit the select action function,

def select_action(self, observation):

values = self.forward(observation)

if (np.random.random() > self.epsilon):

return np.argmax(values)

else:

return np.random.randint(self.env.action_space.n)

Instead of randomly selecting a value every time, this function passes the information about the state of the environment to our neural network and calculates the ‘expected reward’ for each action (values is a function that takes in a (1,nᵢₙ) array and returns a (1,nₒᵤₜ) array). It then selects the action that will lead to the greatest expected reward. Note that this function still selects a random value with probability epsilon. Epsilon, also known as the ‘rate of exploration’ is the implementation of an important concept in reinforcement learning: the tradeoff between exploration and exploitation. Exploration helps the model to not get stuck in a local minimum, by exploring apparently sub-optimal actions from time to time that may reveal greater rewards further down the road. Exploitation on the other hand allows the agent to use its knowledge of the current state to select the most profitable action. In most RL agents, epsilon starts out high (near 1.0) in the beginning and is gradually reduced to 0 over time as the agent becomes more confident in the learnt values of actions in a given state.

The select_action function also calls self.forward (RLAgent.Forward), and here’s the code for that function,

def forward(self, observation, remember_for_backprop=True):

vals = np.copy(observation)

index = 0

for layer in self.layers:

vals = layer.forward(vals, remember_for_backprop)

index = index + 1

return vals

The RLAgent.forward function above has a simple loop. It passes the input (the observation for which we are trying to decide on a course of action) to the network and obtains a set of values for each action. It internally calls the NNLayer.forward function, collecting the output from each layer and passing it to the next layer. To complete the implementation of the select action function, here is the last piece — the NNLayer.forward function. The remember_for_backprop parameter is a boolean that specifies whether or not certain calculated values need to be stored in order to prevent double calculation during the weight updates (this will be explained in more detail in the section on backpropagation).

# Compute the forward pass for this layer

def forward(self, inputs, remember_for_backprop=True):

input_with_bias = np.append(np.ones((len(inputs),1)),inputs, axis=1)

unactivated = np.dot(input_with_bias, self.weights)

output = unactivated

if self.activation_function != None:

output = self.activation_function(output)

if remember_for_backprop:

# store variables for backward pass

self.backward_store_in = input_with_bias

self.backward_store_out = np.copy(unactivated)



return output

This function —

Appends a bias term to the input. Calculates the product of the input and the weight matrix at this layer. Takes the output of step 2, and sends this through an activation function if one has been defined for this layer (which in our case will be ReLU).

Let’s also add the instantiation for these layers in the RLAgent’s init function,

def __init__(self, env):

self.env = env

self.hidden_size = 24

self.input_size = env.observation_space.shape[0]

self.output_size = env.action_space.n

self.num_hidden_layers = 2

self.epsilon = 1.0 self.layers = [NNLayer(self.input_size + 1, self.hidden_size, activation=relu)]

for i in range(self.num_hidden_layers-1):

self.layers.append(NNLayer(self.hidden_size+1, self.hidden_size, activation=relu))

self.layers.append(NNLayer(self.hidden_size+1, self.output_size))

You can see above that we have a total of 2 hidden layers and 1 output layer. Also, in all but the output layer we are using an activation function called Rectified Linear Unit (ReLU). This function is an extremely simple function that introduces sufficient non-linearity to our neural network. Here is the implementation for it,

def relu(mat):

return np.multiply(mat,(mat>0))

This function takes in a matrix and returns another matrix that has identical values where the original matrix was greater than 0, and 0 in all other values. Finally, let’s add the initialization of our agent and epsilon decay to our main program loop. This is what the new main program looks like:

# Global variables

NUM_EPISODES = 10

MAX_TIMESTEPS = 1000

model = RLAgent(env) # The main program loop

for i_episode in range(NUM_EPISODES):

observation = env.reset()

# Iterating through time steps within an episode

for t in range(MAX_TIMESTEPS):

env.render()

action = model.select_action(observation)

observation, reward, done, info = env.step(action)

# epsilon decay

model.epsilon = model.epsilon if model.epsilon < 0.01 else model.epsilon*0.995

if done:

# If the pole has tipped over, end this episode

print('Episode {} ended after {} timesteps, current exploration is {}'.format(i_episode, t+1,model.epsilon))

break

What does this model do currently? It initializes the weights for the neural network randomly and calculates values for actions in any given state based on these weights. However, we need a way for the function to improve the values of the weights in the network such that the agent is able to take the best action in any given state. As alluded to earlier, this is achieved through stochastic gradient descent and implemented via a technique called backpropagation. I will go into detail about backpropagation along with relevant theory of reinforcement learning in the next post.

To summarize, this is what we have accomplished so far:

Wrote the main program components for interacting with the environment space of CartPole in OpenAI gym. Encapsulated the Reinforcement Learning agent and its component Neural Net layers in their respective classes. Coded and initialized the neural network architecture for our deep learning reinforcement learning agent. Implemented the ‘feed forward’ computation to propagate an observation of the environment through the neural network to calculate action values.

In the next post, we will aim to achieve the following:

Examine and formalize a notion of the ‘cumulative reward’ that an agent expects to receive in a particular state for the Cartpole problem. Understand how an agent should update the weights in the neural network to get closer to its idea the correct cumulative rewards expected from taking a particular action in a particular state. Implement backpropagation — the algorithm that achieves the goal mentioned in 2 above.

See you next time!