Creating Deep Neural Networks from Scratch, an Introduction to Reinforcement Learning

Part III: Reflections and Enhancements

Photo by bantersnaps on Unsplash

This is the third and final post in a series designed to give a complete walkthrough to a solution for the cartpole problem on OpenAI gym, built from scratch without using standard machine learning frameworks like Pytorch or Tensorflow. The full code can be found here.

Part I laid the foundations. In it we discussed the neural net architecture and implemented the forward propagation to calculate values for the agent’s actions. Part II delved into the details of reinforcement learning theory, formalizing the notions of Q-values and DQN’s. We also implemented backpropagation in Part II. Part III will contain visualizations and reflections for the agent’s performance with a few different configurations. This final part will also complete the implementation and add enhancements like the Adam optimizer. Here we focus less on the rigor behind hyperparameter choices and more on the exploration of configurations that can be tweaked for model improvement.

At the end of the last section, we had finished the implementation of our Cartpole agent. Time to see the results and the agent’s performance over time!

Tracking Agent’s Improvement

Let’s take a full training run and follow the agent from random initialization to then end when it has learned the art of balancing the pole. This training run took 141 episodes to achieve its goal (an average score of 195 over 100 consecutive episodes).

First, here’s the graph of the scores —

Let’s see now how the agent performs over 3 different periods in the course of its training. The first 5 runs the agent is pretty bad,

First few runs

Midway through its training, we can see the agent has made progress, although there is still room for improvement. Here’s episodes 75 & 76,

Midway through training

Finally, towards the end of training the agent is able to balance the pole almost perfectly. Here’s the 138th run,

Final trained agent

We can see that the agent is pretty good by the end!

Target Network

In Part II’s section on Cumulative Reward and Action Values, we talked about how we were using a simplified version of the complete implementation of DQN by using the same weights to calculate both the predicted action values and the target values. Instead, we need to have a fixed weight network while calculating the target action values (experimental_values in the RLAgent.experience_replay method). Let’s go ahead and implement that. First, we add the initialization of the stored_weights parameter in the NNLayer.init function,

def __init__(self, input_size, output_size, activation=None, lr = 0.001):

self.input_size = input_size

self.output_size = output_size

self.weights = np.random.uniform(low=-0.5, high=0.5, size=(input_size, output_size))

self.stored_weights = np.copy(self.weights)

self.activation_function = activation

self.lr = lr

Remember the calculation of the experimental_values (through the next_action_values calculation) passed in the parameter remember_for_backprop=False. This parameter can indeed be reused to tell the network to use the stored weights rather than the current network weights. Edit the NNLayer.forward function:

# Compute the forward pass for this layer

def forward(self, inputs, remember_for_backprop=True):

# inputs has shape batch_size x layer_input_size

input_with_bias = np.append(inputs,1)

unactivated = None

if remember_for_backprop:

unactivated = np.dot(input_with_bias, self.weights)

else:

unactivated = np.dot(input_with_bias, self.stored_weights)

# store variables for backward pass

output = unactivated

...

Finally, after every experience replay, we will update the stored_weights to the new network weights. Add the boldface line to the last bit of the experience_replay method:

...

for layer in self.layers:

layer.update_stored_weights()

layer.lr = layer.lr if layer.lr < 0.0001 else layer.lr*0.99

Finally, add the NNLayer.update_stored_weights method:

def update_stored_weights(self):

self.stored_weights = np.copy(self.weights)

Great, this relatively simple fix means that our target network calculation does not depend on our current weights.

Average Episodes to Solution

Great, now that we have gone through a typical run, time to see how quickly the agent learns over many different training runs. To do this, we initialize a new agent from scratch over many runs and see how many episodes it takes to achieve the average reward threshold.

Here’s the data over 50 runs.

Episodes to solution for each run

Apart from two runs that got stuck in local minima for a long time and took over 2000 time steps to solve, almost all the other runs took under 200 episodes to converge. The average number of episodes to solve over the 50 runs was 240.84. This includes the two anomalous runs.

Varying Batch Size

Average episode to solution over 20 runs

This plot shows how varying the batch sizes impacts the average episodes to solution (over 20 runs on each batch size). I have tested with 4 different values — 5,10,20 and 40. The best performing in terms of average number of episodes to solve was batch size 20, which had an average of about 173 episodes to solution. However, accounting for the fact that we did half the updates to our algorithm with batch size 10, we were still able to get an average of only 304 episodes to solution. This is about 15% lower than double. In the case of batch size 40, although most of the times the algorithm converged extremely quickly (over 50% solutions were at the lowest possible 100 episode mark), the algorithm was highly unstable in some episodes and did not converge until well over 3000 episodes.

Going forward we will use batch size 10 for the rest of these enhancements.

Adam Optimizer

So far, after calculating the gradients, our NNLayer.update_weights function updates the layer weights using a learning rate that is continuously decreased over time, until a minimum threshold is reached.

Our current weight updates have the same learning rate for each of the parameters in the weight matrix. We will now instead, use the Adam optimization technique and see if that can improve the results. Adam works by keeping a track of individual learning rates for every parameter in the network, using estimates of the first and second moments of the gradient with respect to that parameter. This often leads to faster convergence.

Refer to this post to understand the details following code. If you would like to go directly to hyperparameter configurations, feel free to skip the rest of this section on the implementation of Adam.

Let’s begin. We will change the update_weights method in NNLayer as follows:

def update_weights(self, gradient):

m_temp = np.copy(self.m)

v_temp = np.copy(self.v)



m_temp = self.beta_1*m_temp + (1-self.beta_1)*gradient

v_temp = self.beta_2*v_temp + (1-self.beta_2)*(gradient*gradient)

m_vec_hat = m_temp/(1-np.power(self.beta_1, self.time+0.1))

v_vec_hat = v_temp/(1-np.power(self.beta_2, self.time+0.1))

self.weights = self.weights - np.divide(self.lr*m_vec_hat, np.sqrt(v_vec_hat)+self.adam_epsilon)



self.m = np.copy(m_temp)

self.v = np.copy(v_temp)

The beta_1, beta_2 and adam_epsilon parameters are constants that are used in the implementation of the Adam optimizer. They are almost never changed. The matrices m and v and the time parameter are variables that are updated over the course of training. They are all initialized in the layer’s init method:

def __init__(self, input_size, output_size, activation=None, lr = 0.001):

...

self.lr = lr

self.m = np.zeros((input_size, output_size))

self.v = np.zeros((input_size, output_size))

self.beta_1 = 0.9

self.beta_2 = 0.999

self.time = 1

self.adam_epsilon = 0.00000001

We also replace the reduction of the layer learning rates with an increase in the time parameter for Adam. Adam automatically reduces the learning rate over time using time. Update the last 3 lines of the experience_replay method as such:

...

for layer in self.layers:

layer.update_time()

layer.update_stored_weights()

The update_time() implementation just increases the time parameter by 1 each time.

Compare our implementation with the article linked at the start of this code to verify it is indeed accurate!

Great, now that it is implemented time to see if it actually performs better! Here’s the graph of number of episodes to solution over 50 runs (batch size is 10):

Episodes to solution with Adam optimizer

Although there is still some instability with this, it performs about 17% better (261 against 304) compared to our old optimizer.

Although this is not conclusive and the number of trials is quite small, it shows that Adam can be an effective technique in certain situations. A full analysis of the performance of Adam, along with notes on when to use this versus the other optimization techniques can be found in the original paper.

Hidden Layer Size

The size of the hidden layer also makes a difference. Here’s the average episodes to solution over 4 different hidden layer sizes — 12, 24, 48 & 96. There are 2 hidden layers in each of these experiments depicted by the neural network diagram in Part I,

Episodes to solution over different sizes for the hidden layer

There’s a downward trend in this map, and the best performance is for the layer size 96. Again, the small number of runs doesn’t provide conclusive evidence but it suggests more parameters generally improve the performance of the agent. The tradeoff, of course is that the time and memory requirements in training larger networks is often much greater.

Number of Hidden Layers

So far, all our experiments have been with 2 hidden layers. Trying instead with 1 and 3 layers gives us the following results over 20 runs with 96 hidden units each —

Average steps to convergence with 1 layer — 198.6 Average steps to converge with 2 layers — 163.75 Average steps to convergence with 3 layers — (>1000) episodes. Here the network takes a very long time to converge. Moreover, deeper neural networks suffer from other problems like the vanishing gradients problem that need to be carefully handled.

Unfortunately, hardware limitations prevent me from doing a more thorough analysis of deeper neural networks.

Summary

In this part we concluded and completed the implementation of the algorithm to train our cartpole agent. Updated code including Target Network implementation and Adam can be found here. I intentionally did not do a full analysis of the various hyperparameter configurations for this particular problem because of low generalizability, but here is an informative paper that looks at the effects of changing configurations of DQNs in great detail.

To summarize, here is what we have done in this part.

Analyzed and visualized a sample run for the improvement of the agent, from random initialization to the end state of near-perfect balance. Completed the implementation with addition of a Target Network. Added an Adam optimizer to replace the original blanket learning rate. Explored a few different configurations for hyperparameters like batch size, hidden layer size and number of hidden layers.

Further Reading

If you’ve got this far, you now have a complete implementation of the cartpole problem! You may want to:

Tweak this program further and figure out the optimal hyperparameter configurations. Can you get it down to an average-episodes-to-solution below 110 over 50 runs? Move on to the other problems in the OpenAI gym environment. MountainCar is a good next step! Take a look at the frontiers of reinforcement learning and Artificial General Intelligence. OpenAI keeps a track of all the progress it has made on their blog.

Thank you for reading!!!

References