Machine learning involves manipulating probabilities. These probabilities are most often represented as normalised-probabilities or as log-probabilities. An ability to shrewdly alternate between these two representations is a vital step towards strengthening the probabilistic dexterity we need to solve modern machine learning problems. Today's trick, the log derivative trick, helps us to do just this, using the property of the derivative of the logarithm. This trick shines when we use it to solve stochastic optimisation problems, which we explored previously, and will give us a new way to derive stochastic gradient estimators. We begin by using the trick to define the score function, a function we have all used before, even if not under that name.

Score Functions

The log derivative trick is the application of the rule for the gradient with respect to parameters of the logarithm of a function :

The significance of this trick is realised when the function is a likelihood function, i.e. a function of parameters that provides the probability of a random variable x. In this special case, the function is called a score function, and the right-hand side is the score ratio. The score function has a number of useful properties:

The central computation for maximum likelihood estimation . Maximum likelihood is one of the dominant learning principles used in machine learning, used in generalised linear regression, deep learning, kernel machines, dimensionality reduction, and tensor decompositions, amongst many others, and the score appears in all these problems.

Maximum likelihood is one of the dominant learning principles used in machine learning, used in generalised linear regression, deep learning, kernel machines, dimensionality reduction, and tensor decompositions, amongst many others, and the score appears in all these problems. The expected value of the score is zero . Our first use of the log-derivative trick will be to show this.

In the first line we applied the log derivative trick and in the second line we exchanged the order of differentiation and integration. This identity is the type of probabilistic flexibility we seek: it allows us to subtract any term from the score that has zero expectation, and this modification will leave the expected score unaffected (see control variates later).

The variance of the score is the Fisher information and is used to determine the Cramer-Rao lower bound

We can now leap in a single bound from gradients of a log-probability to gradients of a probability, and back. But the villain of today's post is the troublesome expectation-gradient of Trick 4, re-emerged. We can use our new-found power—the score function—to develop yet another clever estimator for this class of problems.

Score Function Estimators

Our problem is to compute the gradient of an expectation of a function f:

This is a recurring task in machine learning, needed for posterior computation in variational inference, value function and policy learning in reinforcement learning, derivative pricing in computational finance, and inventory control in operations research, amongst many others.

This gradient is difficult to compute because the integral is typically unknown and the parameters , with respect to which we are computing the gradient, are of the distribution . Furthermore, we might want to compute this gradient when the function f is not differentiable. Using the log derivative trick and the properties of the score function, we can compute this gradient in a more amenable way:

Let's derive this expression and explore the implications of it for our optimisation problem. To do this, we are will use one other ubiquitous trick, a probabilistic identity trick, where we multiply our expressions by 1—a one formed by the division of a probability density by itself. Combining the identity trick with the log-derivative trick, we obtain a score function estimator for the gradient:

A lot has happened in these four lines. In the first line we exchanged the derivative and the integral. In the second line, we applied our probabilistic identity trick, which allowed us to form the score ratio. Using the log-derivative trick, we then replaced this ratio with the gradient of the log-probability in the third line. This gives us our desired stochastic estimator in the fourth line, which we computed by Monte Carlo by first drawing samples from p(z) and then computing the weighted gradient term.

This is an unbiased estimator of the gradient. Our assumptions throughout this process have been simple:

The exchange of integration and differentiation is valid. This is hard to show in general, but will rarely be a problem. We can reason about the correctness of this by appealing to the Leibniz integral rule and the available analysis [1].

The function f(z) need not be differentiable. Instead, we should be able to evaluate it or observe its value for a given z.

Obtaining samples from the distribution p(z) is easy, since this is needed for Monte Carlo evaluation of the integral.

Like so many of our tricks, the path we have taken has been trodden upon by many other research areas, each sharing the tale of their adventure with a name related to their problem formulation. These include:

Control Variates

To make this Monte Carlo estimator effective, we must ensure that its variance is as low as possible – the gradient will not be useful otherwise. To give us more control over the variance, we use the modified estimator:

where the new term is called a control variate, which are widely used for variance reduction in Monte Carlo estimators. A control variate is any term that we introduce to the estimator that has zero mean. They can thus always be introduced since they do not affect the expectation; control variates do affect the variance. The choice of control variate is the principal challenge in the use of score function estimators. The simplest method is to use a constant baseline, but other methods can use be used, such as clever sampling schemes (e.g., antithetic or stratified), delta methods, or adaptive baselines. This book by Glasserman [6] provides one of the most comprehensive discussions of the sources of variance and techniques for variance reduction.

Families of Stochastic Estimators

Contrast today's estimator with the one we derived using the pathwise derivative approach in trick 4.

We effectively have two approaches available to us, we can:

differentiate the function f, using pathwise derivatives, if it is differentiable; or

differentiate the density p(z), using the score function.

There are other ways to develop stochastic gradient estimators, but these two are the most general. They can be easily combined, allowing us to use the most appropriate estimator (that provides for the lowest variance) at different parts of our computation. Using a stochastic computational graph provides one framework for achieving this.

Summary

The log derivative trick allows us to alternate between normalised probability and log-probability representations of distributions. It forms the basis of the definition of the score function, and is subsequently what enables maximum likelihood estimation and the important asymptotic analysis that has shaped much of our statistical understanding. Importantly, we can use this to provide a general purpose gradient estimator for stochastic optimisation problems that forms the basis of many of the most significant machine learning problems we currently face. Monte Carlo gradient estimators are needed for the truly general-purpose machine learning we aspire to, making an understanding of them, and the tricks upon which they are built, essential.

Source file http://www.shakirm.com/blog-bib/trickOfTheDay/logDerivTrick.bib could not be read.

Some References