People love three things: large networks, auto-differentiation frameworks, and Andrej Karpathy’s code. I found this out very quickly when looking through implementations of the Reinforce algorithm. In this post, I will go over some common traps of Policy Gradients, show a concise implementation on CartPole-v0, and explain how I computed the gradients of the log policy.

Understanding REINFORCE

If you’re following David Silver’s RL course, you probably just learned about Deep Q Networks. In that case, be sure to note that we must wait for an entire episode to finish before updating our weights. This means you must store your environment transitions in an array and loop through them again after the episode finishes.

Some people get confused by the cumulative future reward. This does not mean multiply the gradient by the sum of the reward we have seen until our current time step. In fact, we must do the opposite. To find our term vₜ, we must sum up all of the future rewards at each state discounting them exponentially by some rate gamma. Some people find it confusing that the amount of reward assigned to a state decreases as we go forwards through time. The intuition here is that earlier actions should be rewarded more heavily because they have higher consequences. Note the discounted future reward function below:

Finding the Gradient

Now that we understand these tricky areas, we just need to implement our weight updates. Easy right? Not quite. REINFORCE is deceptively simple as you have to find the gradient of your log policy. Now we could just write this in Tensorflow and call tf.gradients but we’re smart so lets figure it out ourselves. David Silver derives his gradients for a linear classifier with softmax policy this way but what if we have a different policy architecture? I’m going to show you how I went about finding my gradients.

Though definitely overkill for this problem, I am going to engineer some better features using an approximated RBF Kernel that can be separated linearly.I approached the problem as I would for any computation graph. I compute the gradients of each layer w.r.t. the previous, and combine them with the chain rule. Follow colah’s post if you are unfamiliar with backprop. I use gradient checking to verify each intermediate gradient is implemented correctly as well as confirming my final gradient w.r.t. the weights.

Let’s look at our very simple policy:

def policy(state,w):

z = state.dot(w)

exp = np.exp(z)

return exp/np.sum(exp)

The first thing we must take care of is finding the gradient of the log term w.r.t. policy.

Basically, this means once we find the grad of our policy w.r.t. our weight, we need to divide by our policy to get the gradient of the log of the policy.

Keeping this in mind, we follow Eli Bendersky’s derivation of the softmax function, we compute the full jacobian of the softmax as follows:

def softmax_grad(softmax):

s = softmax.reshape(-1,1)

return np.diagflat(s) — np.dot(s, s.T)

Note that this jacobian contains more information than we need as all we are looking for is the gradient of the policy in state s taking action a. We can extract this information from the full jacobian by taking only the column at the index of the action we took.

dsoftmax = softmax_grad(probs)[action,:]

Great, now in layman’s terms we know “how does wiggling each of our raw network outputs affect the softmax output at action a”.

Now with our aforementioned division by our softmax outputs to get the gradient of the log, we get this:

dlog = dsoftmax / probs[0,action]

Remember, we only care about the gradient of the log of being in a state and taking a specific action w.r.t. each weight. This is why we divide by a scalar or the probability according to our policy of taking an action.

Finally we apply the chain rule to figure out the gradient of the log policy w.r.t. our weights.

The final line will look like this (remember None just adds another dimension):

grad = state.T.dot(dlog[None,:])

Here is our gradient all together:

def softmax_grad(softmax):

s = softmax.reshape(-1,1)

return np.diagflat(s) - np.dot(s, s.T) dsoftmax = softmax_grad(probs)[action,:]

dlog = dsoftmax / probs[0,action]

grad = state.T.dot(dlog[None,:])

Make sure you use gradient checking to ensure you have everything implemented correctly. Your checking code should look vaguely like this:

epsilon = 1e-8

w1 = np.copy(w)

w2 = np.copy(w)

w1[0,0] += epsilon

w2[0,0] -= epsilon grad_check = (np.log(policy(state,w1)[0,action] - np.log(policy(state,w2)[0,action])) / (2. * epsilon) assert np.isclose(grad_check,grad[0,0]) == True

You might want to do this for a the first few elements of your gradient.

Now were ready to implement the full REINFORCE algorithm. Make sure you have gym installed. The cartpole can be solved with a linear classifier, but for more complicated environments we will need to fit a function approximation to some nonlinear policy.

Advanced — Skip this section if you aren’t interested in using kernel methods for your policy.

Though definitely unnecessary for CartPole-v0 , I am going to engineer some higher dimensional features using an approximated RBF Kernel that can be separated linearly. For simplicity, were going to use SKLearn to do this for us. Remember, you can alternatively just use a neural net with nonlinear activation functions to achieve the same effect.

We first need to gather some examples from the environment to fit the featurizer to.

observation_examples = []

for i in range(300):

s,r,d,_ = env.step(1)

observation_examples.append(s)

Next we create and fit the featurizer:

# Create radial basis function sampler to convert states to features for nonlinear function approx

featurizer = sklearn.pipeline.FeatureUnion([

("rbf1", RBFSampler(gamma=5.0, n_components=100)),

("rbf2", RBFSampler(gamma=2.0, n_components=100)),

("rbf3", RBFSampler(gamma=1.0, n_components=100)),

("rbf4", RBFSampler(gamma=0.5, n_components=100))

])

# Fit featurizer to our samples

featurizer.fit(np.array(observation_examples))

We then call this method to transform our states after we receive them from the gym:

def featurize_state(state):

# Transform states

featurized = featurizer.transform([state])

return featurized

End of Advanced Section

Everything else is pretty standard as reinforcement learning implementations go. Here’s the gist:

CODE:

Raw Features (Simple)

Kernelized Features (Advanced and unnecessary for CartPole-v0)

Hope this helped! Tweet me @sam_kirkiles if you have any questions.