Preface

The intended audience of this article is someone who knows something about Machine Learning and Artifical Neural Networks (ANNs) in particular and who recalls that fitting an ANN required a technique called backpropagation. The goal of this post is to refresh the reader’s knowledge of ANNs and backpropagation and to show that the latter is merely a specialised version of automatic differentiation, a tool that all Machine Learning practitioners should know about and have in their toolkit.

Introduction

The problem is simple to state: we have a (highly) non-linear function, the cost function of an Artificial Neural Network (ANN), and we wish to minimize this so as to estimate the parameters / weights of the function.

In order to minimise the function, one obvious approach is to use steepest descent: start with random values for the parameters to be estimated, find the direction in which the the function decreases most quickly, step a small amount in that direction and repeat until close enough.

But we have two problems:

We have an algorithm or a computer program that calculates the non-linear function rather than the function itself.

The function has a very large number of parameters, hundreds if not thousands.

One thing we could try is bumping each parameter by a small amount to get partial derivatives numerically

But this would mean evaluating our function many times and moreover we could easily get numerical errors as a result of the vagaries of floating point arithmetic.

As an alternative we could turn our algorithm or computer program into a function more recognisable as a mathematical function and then compute the differential itself as a function either by hand or by using a symbolic differentiation package. For the complicated expression which is our mathematical function, the former would be error prone and the latter could easily generate something which would be even more complex and costly to evaluate than the original expression.

The standard approach is to use a technique called backpropagation and the understanding and application of this technique forms a large part of many machine learning lecture courses.

Since at least the 1960s techniques for automatically differentiating computer programs have been discovered and re-discovered. Anyone who knows about these techniques and reads about backpropagation quickly realises that backpropagation is just automatic differentiation and steepest descent.

This article is divided into

Refresher on neural networks and backpropagation;

Methods for differentiation;

Backward and forward automatic differentiation and

Concluding thoughts.

The only thing important to remember throughout is the chain rule

in alternative notation

where . More suggestively we can write

where it is understood that and are evaluated at and is evaluated at .

For example,

Acknowledgements

Sadly I cannot recall all the sources I looked at in order to produce this article but I have made heavy use of the following.

Neural Network Refresher

Here is our model, with the input, the predicted output and the actual output and the weights in the -th layer. We have concretised the transfer function as but it is quite popular to use the function.

with the loss or cost function

The diagram below depicts a neural network with a single hidden layer.

In order to apply the steepest descent algorithm we need to calculate the differentials of this latter function with respect to the weights, that is, we need to calculate

Applying the chain rule

Since

we have

Defining

we obtain

Finding the for each layer is straightforward: we start with the inputs and propagate forward. In order to find the we need to start with the outputs a propagate backwards:

For the output layer we have (since )

For a hidden layer using the chain rule

Now

so that

and thus

Summarising

We calculate all and for each layer starting with the input layer and propagating forward. We evaluate in the output layer using . We evaluate in each layer using starting with the output layer and propagating backwards. Use to obtain the required derivatives in each layer.

For the particular activation function we have . And finally we can use the partial derivatives to step in the right direction using steepest descent

where is the step size aka the learning rate.

Differentiation

So now we have an efficient algorithm for differentiating the cost function for an ANN and thus estimating its parameters but it seems quite complex. In the introduction we alluded to other methods of differentiation. Let us examine those in a bit more detail before moving on to a general technique for differentiating programs of which backpropagation turns out to be a specialisation.

Numerical Differentiation

Consider the function then its differential and we can easily compare a numerical approximation of this with the exact result. The numeric approximation is given by

In theory we should get a closer and closer approximation as epsilon decreases but as the chart below shows at some point (with ) the approximation worsens as a result of the fact that we are using floating point arithmetic. For a complex function such as one which calculates the cost function of an ANN, there is a risk that we may end up getting a poor approximation for the derivative and thus a poor estimate for the parameters of the model.

Symbolic Differentiation

Suppose we have the following program (written in Python)

import numpy as np def many_sines(x): y = x for i in range ( 1 , 7 ): y = np.sin(x+y) return y

When we unroll the loop we are actually evaluating

Now suppose we want to get the differential of this function. Symbolically this would be

Typically the non-linear function that an ANN gives is much more complex than the simple function given above. Thus its derivative will correspondingly more complex and therefore expensive to compute. Moreover calculating this derivative by hand could easily introduce errors. And in order to have a computer perform the symbolic calculation we would have to encode our cost function somehow so that it is amenable to this form of manipulation.

Automatic Differentiation

Reverse Mode

Traditionally, forward mode is introduced first as this is considered easier to understand. We introduce reverse mode first as it can be seen to be a generalization of backpropagation.

Consider the function

Let us write this a data flow graph.

We can thus re-write our function as a sequence of simpler functions in which each function only depends on variables earlier in the sequence.

In our particular example, since do not depend on

Further does not depend on so we also have

Now things become more interesting as and both depend on and so the chain rule makes an explicit appearance

Carrying on

Note that having worked from top to bottom (the forward sweep) in the graph to calculate the function itself, we have to work backwards from bottom to top (the backward sweep) to calculate the derivative.

So provided we can translate our program into a call graph, we can apply this procedure to calculate the differential with the same complexity as the original program.

The pictorial representation of an ANN is effectively the data flow graph of the cost function (without the final cost calculation itself) and its differential can be calculated as just being identical to backpropagation.

Forward Mode

An alternative method for automatic differentiation is called forward mode and has a simple implementation. Let us illustrate this using Haskell 98. The actual implementation is about 20 lines of code.

First some boilerplate declarations that need not concern us further.

> {-# LANGUAGE NoMonomorphismRestriction #-} > > module AD ( > Dual ( .. ) > , f > , idD > , cost > , zs > ) where > > default ()

Let us define dual numbers

> data Dual = Dual Double Double > deriving ( Eq , Show )

We can think of these pairs as first order polynomials in the indeterminate , such that

Thus, for example, we have

Notice that these equations implicitly encode the chain rule. For example, we know, using the chain rule, that

And using the example equations above we have

Notice that dual numbers carry around the calculation and the derivative of the calculation. To actually evaluate at a particular value, say 2, we plug in 2 for and 1 for

Thus the derivative of at 2 is .

With a couple of helper functions we can implement this rule ( ) by making Dual an instance of Num, Fractional and Floating.

> constD :: Double -> Dual > constD x = Dual x 0 > > idD :: Double -> Dual > idD x = Dual x 1.0

Let us implement the rules above by declaring Dual to be an instance of Num. A Haskell class such as Num simply states that it is possible to perform a (usually) related collection of operations on any type which is declared as an instance of that class. For example, Integer and Double are both types which are instances on Num and thus one can add, multiply, etc. values of these types (but note one cannot add an Integer to a Double without first converting a value of the former to a value of the latter).

As an aside, we will never need the functions signum and abs and declare them as undefined; in a robust implementation we would specify an error if they were ever accidentally used.

> instance Num Dual where > fromInteger n = constD $ fromInteger n > ( Dual x x' ) + ( Dual y y' ) = Dual ( x + y ) ( x' + y' ) > ( Dual x x' ) * ( Dual y y' ) = Dual ( x * y ) ( x * y' + y * x' ) > negate ( Dual x x' ) = Dual ( negate x ) ( negate x' ) > signum _ = undefined > abs _ = undefined

We need to be able to perform division on Dual so we further declare it to be an instance of Fractional.

> instance Fractional Dual where > fromRational p = constD $ fromRational p > recip ( Dual x x' ) = Dual ( 1.0 / x ) ( - x' / ( x * x ) )

We want to be able to perform the same operations on Dual as we can on Float and Double. Thus we make Dual an instance of Floating which means we can now operate on values of this type as though, in some sense, they are the same as values of Float or Double (in Haskell 98 only instances for Float and Double are defined for the class Floating).

> instance Floating Dual where > pi = constD pi > exp ( Dual x x' ) = Dual ( exp x ) ( x' * exp x ) > log ( Dual x x' ) = Dual ( log x ) ( x' / x ) > sqrt ( Dual x x' ) = Dual ( sqrt x ) ( x' / ( 2 * sqrt x ) ) > sin ( Dual x x' ) = Dual ( sin x ) ( x' * cos x ) > cos ( Dual x x' ) = Dual ( cos x ) ( x' * ( - sin x ) ) > sinh ( Dual x x' ) = Dual ( sinh x ) ( x' * cosh x ) > cosh ( Dual x x' ) = Dual ( cosh x ) ( x' * sinh x ) > asin ( Dual x x' ) = Dual ( asin x ) ( x' / sqrt ( 1 - x * x ) ) > acos ( Dual x x' ) = Dual ( acos x ) ( x' / ( - sqrt ( 1 - x * x ) ) ) > atan ( Dual x x' ) = Dual ( atan x ) ( x' / ( 1 + x * x ) ) > asinh ( Dual x x' ) = Dual ( asinh x ) ( x' / sqrt ( 1 + x * x ) ) > acosh ( Dual x x' ) = Dual ( acosh x ) ( x' / ( sqrt ( x * x - 1 ) ) ) > atanh ( Dual x x' ) = Dual ( atanh x ) ( x' / ( 1 - x * x ) )

That’s all we need to do. Let us implement the function we considered earlier.

> f = sqrt . ( * 3 ) . sin

The compiler can infer its type

ghci> :t f f :: Floating c => c -> c

We know the derivative of the function and can also implement it directly in Haskell.

> f' x = 3 * cos x / ( 2 * sqrt ( 3 * sin x ) )

Now we can evaluate the function along with its automatically calculated derivative and compare that with the derivative we calculated symbolically by hand.

ghci> f $ idD 2 Dual 1.6516332160855343 (-0.3779412091869595) ghci> f' 2 -0.3779412091869595

To see that we are not doing symbolic differentiation (it’s easy to see we are not doing numerical differentiation) let us step through the actual evaluation.

A Simple Application

In order not to make this blog post too long let us apply AD to finding parameters for a simple regression. The application to ANNs is described in a previous blog post. Note that in a real application we would use the the Haskell AD and furthermore use reverse AD as in this case it would be more efficient.

First our cost function

> cost m c xs ys = ( / ( 2 * ( fromIntegral $ length xs ) ) ) $ > sum $ > zipWith errSq xs ys > where > errSq x y = z * z > where > z = y - ( m * x + c )

ghci> :t cost cost :: Fractional a => a -> a -> [a] -> [a] -> a

Some test data

> xs = [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ] > ys = [ 3 , 5 , 7 , 9 , 11 , 13 , 15 , 17 , 19 , 21 ]

and a learning rate

> gamma = 0.04

Now we create a function of the two parameters in our model by applying the cost function to the data. We need the (partial) derivatives of both the slope and the offset.

> g m c = cost m c xs ys

Now we can take use our Dual numbers to calculate the required partial derivatives and update our estimates of the parameter. We create a stream of estimates.

> zs = ( 0.1 , 0.1 ) : map f zs > where > > deriv ( Dual _ x' ) = x' > > f ( c , m ) = ( c - gamma * cDeriv , m - gamma * mDeriv ) > where > cDeriv = deriv $ g ( constD m ) $ idD c > mDeriv = deriv $ flip g ( constD c ) $ idD m

And we can calculate the cost of each estimate to check our algorithm converges and then take the the estimated parameters when the change in cost per iteration has reached an acceptable level.

ghci> take 2 $ drop 1000 $ map (\(c, m) -> cost m c xs ys) zs [1.9088215184565296e-9,1.876891490619424e-9] ghci> take 2 $ drop 1000 zs [(0.9998665320141327,2.0000191714150106),(0.999867653022265,2.0000190103927853)]

Concluding Thoughts

Efficiency

Perhaps AD is underused because of efficiency?

It seems that the Financial Services industry is aware that AD is more efficient than current practice albeit the technique is only slowly permeating. Order of magnitude improvements have been reported.

Perhaps AD is slowly permeating into Machine Learning as well but there seem to be no easy to find benchmarks.

If it were only possible to implement automatic differentiation in Haskell then its applicability would be somewhat limited. Fortunately this is not the case and it can be used in many languages.

In general, there are three different approaches:

Operator overloading: available for Haskell and C++. See the Haskell ad package and the C++ FADBAD approach using templates.

Source to source translators: available for Fortran, C and other languages e.g., ADIFOR, TAPENADE and see the wikipedia entry for a more comprehensive list.

New languages with built-in AD primitives. I have not listed any as it seems unlikely that anyone practicing Machine Learning would want to transfer their existing code to a research language. Maybe AD researchers could invest time in understanding what language feature improvements are needed to support AD natively in existing languages.