Ideas from this summary are taken from the Proximal Policy Optimization paper.

PPO offers two key improvements to policy gradient methods:

Surrogate objective include a simple first order trust region approximation multiple epochs can be performed on collected data

The first improvement is what enables the possibility of the second improvement.

Clipped Objective Motivation

ratio between new policy and old policy

A is an advantage estimate epsilon is a hyper-parameter

The purpose of the clipped surrogate objective is to stabilize training via constraining the the policy changes at each step. Our gradient is only a local approximation so taking large steps could be harmful. Smaller steps also allow us to run multiple epochs on collected samples.

Clipping Effect on Gradients

Clipping only occurs after the update from the first minibatch of the first epoch. Prior to this the ratio will be 1. When the ratio is clipped our gradient will be zero. This means that some of the collected samples are actually “thrown away.”

Update Scheme

The pseudo code above shows that we first collect T*N samples. Where T is the number of rollout steps each actor takes between updates and N is the number of actors. From the rollouts we calculate generalized advantage estimates:

We now have a batch of T*N samples. From this batch we sample minibatches of size m. We perform updates on each of the individual minibatches and then at the next epoch we shuffle the batch to generate a new combination samples in each minibatch. After repeating this for K epochs we update the old parameters to reflect the current parameters