Introduction

In the previous article we discussed how to manipulate dataframes and train a Random Forest model to help classify the species of irises. In this article we will use a library in Julia called Flux, that will allow us to build a convolutional neural network to predict hand written digits from the popular MNIST data set. We will use an example from the model-zoo to create the model and run the training data through it. We'll break down each statement in the example and discuss what its doing for a better understanding.

With enough training data, and variety of examples, machines can be trained to recognize images just like humans. In this exercise, we'll take advantage of the power of the neural network to go through the MNIST images. The MNIST data set consists of thousands of images of handwritten digits. Every image is the same size and every image is a digit of data from 0 to 9. Below is a sample of what some of the data looks like:

As you can see, many of the images are clearly discernable as a number, but others less so. Convolutional neural networks are a special kind of deep, multilayer backpropagation neural network. The convolution operation helps the network recognize patterns in the image of the digits during preprocessing. By introducing convolution into the neural network, it greatly reduces the time to process the gradients in the network. There are other important functions involved in the deep learning CNN and we'll discuss them as we go through the code:

Flux

Flux is a library that touts a framework that looks a lot like mathematics when you write the code. It is also very straightforward to put together a neural network in flux as you will soon see. A very nice feature of Flux, is it allows you to easily take advantage of the GPU on a machine that has the hardware. The GPU makes training your model a lot faster than from just the standard CPU on your desktop. Another nice thing about flux is it easily allows you to chain the different kinds of layers you need in a neural network. Below is a simple example of a chained set of deep learning layers used with Flux. This example has 2 dense layers and a SoftMax output layer.

model_deep = Chain( Dense(10, 5, σ), Dense(5, 2), softmax)

The Dense layer is simply a layer of neurons. In the example above the first dense layer has 10 inputs and 5 outputs, the next neuron dense layer has 5 inputs (coming from the output of the previous layer and 2 outputs. The SoftMax layer is a normalizing layer that ensures all the outputs will add up to 1. This will be useful in our MNIST neural network, the SoftMax function will spit out probabilities that our handwritten test digit image is of a particular number, the highest probability being 1 and the lowest 0. For example if a digit 9 is fed into the neural network, the softmax function will hopefully give a probability of the output representing 9 a number closer to 1 (e.g. .90) and the output representing all the other digits a lower probability adding up to the remaining sum (e.g. .10) .

The sample code

Let's take a closer look at the model-zoo sample Julia code that works on the MNIST data set. First, make sure that Flux is installed on your computer by using the command Pkg.add("Flux"). Be sure to capitilize flux, because it is case-sensitive. The first part of our Jupyter notebook in the example sets up the libraries we are going to use to build our model:

These statements bring in Flux, the MNIST data set, some statistics functions, and some special functions provided in the flux library for dealing with the neural network. It also brings some additional functions for partioning the data for training.

The next piece bring in the MNIST data into forms we can use by the neural network. The images need to be in a 2d array of real pixel values and the labels need to be in one-hot encoded arrays that will describe each image:

imgs = MNIST.images() labels = onehotbatch(MNIST.labels(), 0:9)

Looking closer at the results of the data, the images look like the following:

60000 -element Array { Array {ColorTypes.Gray{FixedPointNumbers.Normed{UInt8, 8 }}, 2 }, 1 }: [Gray{N0f8}( 0.0 ) Gray{N0f8}( 0.0 ) … Gray{N0f8}( 0.0 ) Gray{N0f8}( 0.0 ); Gray{N0f8}( 0.0 ) Gray{N0f8}( 0.0 ) … Gray{N0f8}( 0.0 ) Gray{N0f8}( 0.0 ); … ; Gray{N0f8}( 0.0 ) Gray{N0f8}( 0.0 ) … Gray

Each digit image in the dataset is a 28x28 array of pixels making up the image. Looking closer at a single image, we can actually plot it using the plotting library. The first image is closest to the number 5. Although it looks more like a backwards Z, it still looks more like a 5 than it does the other digits. As we train the CNN on other numbers, it will learn to classify the 5 image shown as being a 5.

The corresponding result label verifies the image is indeed a 5:





when we one-hot encode the label 5, the resulting vector looks like this:

[ false false false false false true false false false false ]

One hot encoding allowed us to convert our categorical labels: "one", "two", "three", ... etc. Into machine readable values, if our label is "one" in our result set, it would be one-hot encoded against the categories ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"] as ["false", "true", "false", "false", "false", "false", "false", "false", "false"]. In the matrix above, each column represents a one-hot encoded value of the label for the corresponding position in the vector. The first column sample label is 5 because it has a true in the column category representing 5. There are 60,000 of these columns, each representing a different label in the MNIST set.

Partitioning the Data

Next we need to partition the data in sets of 1000 samples each to train the model we are creating. Flux gives us a library function to help us with the partitioning operation:

train = [(cat( float .(imgs[i])..., dims = 4 ), labels[:,i]) for i in partition ( 1 : 60 _000, 1000 )]

This function breaks up the training data (both the images and the labels into batches of 1000 samples each for training. The final training data looks like the following type, an array of tuples with the first part of the tuple being the image, and the second part of the tuple the one-hot encoded label.

Array {Tuple{ Array {Float64, 4 },Flux.OneHotMatrix{ Array {Flux.OneHotVector, 1 }}}, 1 }

If we look at train[1][1], the first part of the tuple we can see it contains an array of 1000 images of 28x28 pixels:

train[ 1 ][ 1 ]

The train[1][2] data, the second part of the tuple, is a 1000 columns of one hot-encoded labels. Note that if we evaluate the first 5 columns of one-hot encoded labels of our 1st batch, we come up with values of 5,0,4,1,9 respectively. We'll see these values later as the first 5 labels in our test set for validating the model.

train[ 1 ][ 2 ]

The data is now ready to feed into our neural network model that we are about to create, but before we do that, let's just pull a validation test set out to feed into our model after we train it. We can take advantage of the GPU here.

train = gpu.(train) tX = cat(float.(MNIST.images(: test )[1:1000])..., dims = 4) |> gpu tY = onehotbatch(MNIST.labels(: test )[1:1000], 0:9) |> gpu

Building the Model

Okay, now that we examined and prepared the data, we can now build our CNN (Convolutional Neural Network) to train the digit images on their corresponding labels. The model will have 2 convolutional layers, 2 maxpool layers, a dense layer, and finally a softmax.

model = Chain( Conv(( 2 , 2 ), 1 => 16 , relu), x -> maxpool(x, ( 2 , 2 )), Conv(( 2 , 2 ), 16 => 8 , relu), x -> maxpool(x, ( 2 , 2 )), x -> reshape(x, :, size(x, 4 )), Dense( 288 , 10 ), softmax) |> gpu

The convolutional layer takes the following parameters: size, # input => # output, and a non-linear activation function (e.g. sigmoid, softmax, or relu). What does the convolutional layer do? The magic of the convolutional layer is it convolves the digit image with a cumulative result that learns the filter. Convolution is best explained as a a matrix traveling along another matrix and doing an element-wise multiplication against it to get a resulting convolved feature matrix. For further reading, check out this article that explains the convolution step very well with an animation of the convolution occuring between matrices. The Conv function above, takes the feature matrix dimensions as its first parameter. In other words, the first layer will use a 2x2 matrix and slide it over the image pixels to learn the feature detection filter. The first layer will produce 16 outputs from 1 input as indicated by 1=>16 in the second parameter. The relu function, in the third parameter, is a function that simply sets all negative values computed in the image to 0. The reason we do this, is because we want to introduce a non-linear function into our learning network that mimics very much what our eyes do when detecting images. There are other functions that do similar things such as sigmoid or tanh, which both mimic similar bounded functions, but relu seems to perform faster and gives good results.

Max Pooling

After we apply the convolution layer, we apply another function called max pooling which helps reduce the dimensionality of the features that the network is learning, but still maintains the important feature information for learning. The pooling function in our neural network, slides a 2x2 matrix over different sections of the 2d input and pools them into a single value. So a 28 x 28 matrix pooled by a 2x2 pooling matrix would be reduced to a 14x14 matrix or a factor of 4. The pooling function can be anything from max value, average, or sum. Max value is most commonly used, since it seems to be very effective.

Feature Extraction

Note that the first 4 layers (Convolution, MaxPool, Convolution, MaxPool) are used for feature extraction. These layers take a 28x28 image input and aggregate into a simpler set of features representing the images. As more images are passed over these 4 layers, the feature extractor portion learns a simplified set of features from the images fed through it. This will make it much easier for the neural net to classify digits in the remaining part of the network

Classification

Let's look at our deep learning model again:

model = Chain( Conv(( 2 , 2 ), 1 => 16 , relu), x -> maxpool(x, ( 2 , 2 )), Conv(( 2 , 2 ), 16 => 8 , relu), x -> maxpool(x, ( 2 , 2 )), x -> reshape(x, :, size(x, 4 )), Dense( 288 , 10 ), softmax) |> gpu

We have a reshape function whose job is to get the pooled data into a form that the Dense neural net can process. The dense neural net takes 288 inputs from the last maxpool function and trains on the data to produce 10 outputs. The outputs are then subjected to the softmax function which essentially squashes the values to values between 0 and 1. As we previously explained, all values produced by the softmax function will total 1. The error between the output and actual label value will be fed back through the network to train the weights inside the network. This is called backpropogation. The deep learning uses a technique called gradient descent to adjust the weights of the network based on the error. As the network weights are adjusted with each new training dataset (image, label), the neural net model gets better and better at predicting the digit fed through it.

Training

In order to train our model in flux, we will require three functions as parameters:

an objective function - allows the network to see how close we are to the result and used for gradient decent an optimizer - a function that operates on the weight parameters of the network, to decrease the loss and drive gradient decent. an evaluation function to show the progress of the training

Below are the three functions we'll use to pass into our training function:

1 ) loss(x, y) = crossentropy(m(x), y) 2 ) opt = ADAM(params(m)) 3 ) accuracy(x, y) = mean(onecold(m(x)) .== onecold(y)) evalcb = throttle( () -> @show(accuracy(tX, tY)), 10 )

The loss function is mapped to a built in function called crossentropy that compares the result of the model (our network) operating on the images and the actual label on the image. The accuracy is computed by doing a one-cold calculation (one-cold is basically a reversal of the one-hot operation) on the predicted results of our test data and compares it against the actual results of the test data . The accuracy function then takes the average result of correct predictions against total predictions to produce a single decimal value. the evalcb function throttles the accuracy call to display every 10 seconds. The optimize function is called ADAM, a stochastic optimization function whose description can be found here. Other optimization functions such as SGD (stochastic gradient decent) can also be tried

To train the model we will use the following Flux function containing the 3 functions we talked about:

Flux.train!(loss, train, opt, cb = evalcb)

The output below shows the evaluation of running Flux.train! one time through the data. The throttle function in the evalcb function displays the progression of accuracy every 10 seconds of the model as its going through the training. Initially, the results produced by the training only produce a 10% (.099) accuracy on the test data, but by the end of training on the 60,000 images, it produced an accuracy of 56% (.557).

accuracy(tX, tY) = 0.099 accuracy(tX, tY) = 0.113 accuracy(tX, tY) = 0.195 accuracy(tX, tY) = 0.168 accuracy(tX, tY) = 0.184 accuracy(tX, tY) = 0.323 accuracy(tX, tY) = 0.476 accuracy(tX, tY) = 0.557

If we run the data through the Flux.train! function 10 times, we start to approach accuracies of 96%. The output below shows the accuracy of the model after running the MNIST data through the model for the 10th time.

accuracy(tX, tY) = 0.952 accuracy(tX, tY) = 0.952 accuracy(tX, tY) = 0.954 accuracy(tX, tY) = 0.952 accuracy(tX, tY) = 0.956 accuracy(tX, tY) = 0.958 accuracy(tX, tY) = 0.951 accuracy(tX, tY) = 0.955

If we continued training our CNN, we would see accuracies as high as 98%!

Now let's feed our data through the model and look at the results. The first 13 labels of the data are as follows. We used the onecold encoding to get the label out of the one-hot encoded number. Note, we will need to subtract 1 from all digit labels since Julia arrays start at 1 instead of 0.

Running the model on the first 13 images of our test set produce the following results:

Note that the CNN model predicted the first digit wrong mistaking a 5 for a 3, but the neural network model predicted the rest of the 12 digits accurately. If we look closely at the 5 image (the backwards z previously in the article), we can see how it might be possible for even the human eye to mistake the digit for a 3.

Conclusion

Convolutional neural networks are a powerful way to classify images. This technology is currently used in artificial intelligence applications such as radiology or identifying cancer cells. I look forward to seeing the many advances in AI and tools like Flux in Julia make this possible.