This is a simple explanation of Policy Gradient algorithm in Reinforcement Learning (RL). First, some background of Supervised Learning is presented and then, Policy Gradient method is studied as an extension to the Supervised Learning formulation. The blog post will clarify some of the notations usually covered in Reinforcement Learning lectures and build the basics for studying advanced RL algorithms.

Supervised Learning Background¶

Let us start with a simple MultiLayer Perceptron (MLP) which is built for the task of classification. In a typical MLP, we have \(C\) output units, where \(C\) is the number of classes in the classification problem and we train it with cross entropy loss. The cross entropy between the prediction \(\mathbf{\hat{y}}\) (\(\mathbf{\hat{y}}\) is a \({C \times 1 }\) vector) and ground truth label \(\mathbf{y}\) (which is usually one-hot) is given by:

\[H(\mathbf{\hat{y}}, \mathbf{y}) = - \mathbf{\hat{y}} \log (\mathbf{y}) = - \sum_{j=1}^{C} y_j \log(\hat{y}_j)\]

If we have \(N\) training samples, then the cross entropy loss over the samples will be \(\sum_{i=1}^{N} H(\mathbf{\hat{y}}_i, \mathbf{y}_i)\).

The above formulation can also be viewed from a different perspective. If we have \(N\) observed variables \(y_1 \dots y_n\) (ground truth labels for the data samples) and a function approximator (say MLP) with parameters \(\theta\), then likelihood \(lik(\theta)\) is defined as:

\[lik(\theta) = \prod_{i=1}^{N} p(y_i | x_i, \theta)​\]

\(p(y_i | x_i, \theta)\) gives the probability estimated by our model with parameters \(\theta\) for the observed variable \(y_i\), given the input \(x_i\). In layman terms, if we are using neural networks, we calculate likelihood function by multiplying the probability at the corresponding \(j_i\) output unit (where \(j_i\) is the ground truth label of the data sample \(i\)) for each input sample \(x_i\). By maximizing the above likelihood function, we find the parameters \(\theta\) which make the observed data “most probable”. In practice, rather than maximizing the direct likelihood, we usually minimize the negative log likelihood function which is much easier to deal with mathematically:

\[\begin{equation} \label{eq:1} loglik(\theta) = - \sum_{i=1}^{N} \log(p(y_i | x_i, \theta)) \end{equation}\]

Thus, Supervised Learning problem for the task of classification can be formulated as:

Minimize \(-\sum_{i=1}^{N} \log(p(y_i | x_i, \theta))\)

\(= -\sum_{i=1}^{N} \log(\hat{y_i}^{j_i})\) (\(j_i\) is the actual class of sample \(i\) and \(\hat{y_i}^{j_i}\) refers to the probability estimated by MLP)

\(= -\sum_{i=1}^{N} \sum_{j=1}^{C} y_i^{j_i} \log(\hat{y_i}^{j_i})\) (since \(\mathbf{y}_i\) is a one hot vector with class \(j_i\), we can replace \(\hat{y_i}^{j_i}\) with summation)

\(= \sum_{i=1}^{N} H(\mathbf{\hat{y}}_i, \mathbf{y}_i)\)

We have proved that minimizing the cross entropy loss \(\sum_{i=1}^{N} H(\mathbf{\hat{y}}_i, \mathbf{y}_i)\) is equivalent to minimizing the negative log likelihood \(-\sum_{i=1}^{N} \log(p(y_i | x_i, \theta))\).