Neural networks and Deep Learning, Chapter 1 Introduction This post is the first in what I hope will be a series, as I work through Michael Nielsen's free online book Neural Networks and Deep Learning. Nielsen provides Python scripts to implement the networks he describes in the text. I originally thought that I'd use this as a way to learn Python, but through laziness I've instead decided to stick with R and Rcpp; my rationalisation is that if I code everything from scratch, then I will be forced to learn the algorithms better than if I just stare at Nielsen's code. I don't know if R and Rcpp is a sensible strategy in the long run – R doesn't always cope well with even moderately large datasets, and I expect to encounter these in machine learning problems – but for the MNIST database of handwritten digits, which is the main dataset used in the book and which contains 60,000 28x28-pixel digits, everything seems to go OK. The book feels to me (so far!) like it should be accessible in all its mathematical details to a 3rd or 4th year undergrad in a mathsy course. I was able to derive most of chapter 2 (albeit implementing it inefficiently when I first coded it up) before reading it, and it felt like a maths assignment at roughly that level. Since you can get the full details and explanations from Nielsen, I'll be pretty sparse with the theory in these posts. Some results of my first working neural network and some non-working ones are in the results section below. Setup A neural network consists of a series of layers, each layer containing some number of neurons. The first layer is the input; for the digit-recognition problem, there will be one input neuron for each pixel, so 282 = 784 neurons. The output layer is usually set by what we want to have the network do – in this case, there will be 10 output neurons, one for each digit between 0 and 9. We would like an output neuron to fire if and only if the input image shows the corresponding digit. In between the input and output layers is some number of so-called hidden layers. For this chapter there will only be one hidden layer; apparently it's a lot harder to train a network with more than one hidden layer, so we'll be starting simple. Given the large number of input neurons and small number of output neurons, I expect that the layers are monotone decreasing in their numbers of neurons, but I don't know for sure. (In the recommended setup for this chapter, there are 30 neurons in the hidden layer, in between the 784 inputs and 10 outputs.) Each neuron in the input layer has some value, read from the current training image; let \( \bm{a}^0 \) be the vector of these values. (I'll use superscript indices to denote the relevant layer in the network; be careful if you're working through these notes closely though, since sometimes I deviate from Nielsen's indexing conventions. He probably uses standard conventions and mine should be considered anarchic.) The "activation" value of each neuron in the next layer (i.e., the hidden layer) \( a^1_j \) is a linear combination of those in the first layer, plus the neuron's "bias" \( b^1_j \), and then transformed to the interval \( (0, 1) \) by a sigmoid function, \( \sigma(z) = 1 / (1 + \exp(-z)) \). Allowing \( \sigma \) to work element-wise through a vector, and letting \( W^0 \) be an \( N_1 \times N_0 \) matrix, the vector of activation values of the hidden layer's neurons is

\begin{equation*} \bm{a}^1 = \sigma(W^0 \bm{a}^0 + \bm{b}^1). \end{equation*}

Subsequent layers of neurons are defined similarly. The goal of training the network is to set all the weights \( W^i_{jk} \), and all of the biases \( b^i_j \), so that the activation values of the output neurons agree with the true answer whenever a new image is input (for this chapter, all of these values are initialised random standard Gaussian numbers). To this end, 50,000 of the MNIST images are used as a training set, and 10,000 as a test set (written by a different set of people to those from the training set, so that the testing is properly out of sample). In a short while, it will be useful to have defined

\begin{equation*} \bm{z}^1 = W^0 \bm{a}^0 + \bm{b}^1, \end{equation*}

so that, in general, \( \bm{a}^i = \sigma(\bm{z}^i) \). The training method is surprisingly simple conceptually, at least for this introductory chapter, and the details of its derivation are also not particularly difficult – it involves repeated application of the chain rule with partial derivatives, and the biggest troubles come from getting lost in the indices (and regular checks to make sure that you remembered which way round the dimensions of the \( W^i \) matrices are). Let \( L \) be the (superscript) index of the output layer. For a given training image, let \( \bm{y}^L \) be the desired outputs; this will be a vector with a 1 in the position of the handwritten digit and zeros elsewhere. I've written \( \bm{y}^L \) with a superscript index, to remind us that it applies to the output layer, but we won't have \( \bm{y} \)'s anywhere else. The actual output of the current network is \( \bm{a}^L \). Define the cost function \( C \) (a function of all the weights and biases) by

\begin{align}

onumber C &= \frac{1}{2} || \bm{a}^L - \bm{y}^L ||^2 \\ \label{eqn_cost} &= \frac{1}{2} \sum_j (a^L_j - y^L_j)^2. \end{align}

I have sneakily omitted writing the start and end of the sum over \( j \) in the latter sum, mostly so that I leave worries about zero-indexing conventions to the code. It is a sum over the output neurons, running either from 1 to \( N_L \) or from 0 to \( (N_L - 1) \). The technique now is to use gradient descent, where the gradient \(

abla \) has a component for every element of every weight matrix \( W^i_{jk} \) and every bias \( b^i_j \). The gradient \(

abla C \) is a vector pointing in the direction of greatest increase of the scalar field \( C \); since the goal is to minimise the cost function, the procedure is to increment all weights and biases in the direction opposite \(

abla C \). Consider one of the components of the gradient, \( \partial C / \partial b^L_j \) (it is easiest to start with the output layer). Equation \eqref{eqn_cost} is written only in terms of the output activations \( a^L_j \), so apply the chain rule:

\begin{equation*} \frac{\partial C}{\partial b^L_j} = \sum_k \frac{\partial C}{\partial a^L_k}\frac{\partial a^L_k}{\partial b^L_j}. \end{equation*}

The partial derivative \( \partial C / \partial a^L_k \) is just \( a^L_k - y^L_k \). The key then to computing the gradient of the cost function is to find the partial derivatives of the output activations with respect to each of the biases and weights. For the output layer, this is straightforward, since

\begin{align*} \frac{\partial a^L_k}{\partial b^L_j} &= \frac{\partial}{\partial b^L_j} \sigma((W^{L-1}\bm{a}^{L-1})_k + b^L_k) \\ &= \frac{d\sigma(z^L_k)}{dz^L_k} \frac{\partial z^L_k}{\partial b^L_k} \\ &= \frac{d\sigma(z^L_k)}{dz^L_k}. \end{align*}

To save space, I'll write \( d\sigma(z^i_k) / dz^i_k \) as \( \sigma'^i_k \). This quantity is easy to compute, since the current \( \bm{z}^i \) values are known, and \( \sigma'(z) = \exp(-z) / (1 + \exp(-z))^2 \), which can also be written as \( \sigma(z)(1 - \sigma(z)) \). Working backwards through the layers, the procedure will still be to apply the chain rule, and the result will be a growing collection of \( W \)'s and \( \sigma' \)'s. The fortunate aspect is that the derivatives of the earlier layer's variables can be expressed in terms of the derivatives of the subsequent layer, with only one new \( W \) and one new \( \sigma' \). Summing over repeated indices that don't appear on the LHS (!):

\begin{align*} \frac{\partial a^L_k}{\partial b^{L-1}_j} &= \frac{\partial}{\partial b^{L-1}_j} \sigma((W^{L-1}\sigma(W^{L-2}\bm{a}^{L-2} + \bm{b}^{L-2}) + \bm{b}^L)_k) \\ &= \sigma'^L_k W^{L-1}_{kl} \frac{\partial}{\partial b^{L-1}_j} \sigma(W^{L-2}_{lm}a^{L-2}_m + b^{L-1}_l) \\ &= \sigma'^L_k W^{L-1}_{kl} \sigma'^{L-1}_l \delta_{jl} \\ &= \sigma'^L_k \sigma'^{L-1}_j W^{L-1}_{kj}. \end{align*}

The \( \sigma'^L_k \) at the end there is the same as for the derivative for the output layer; any derivatives for still earlier layers (not that we have any for this chapter!) end up with sums over repeated indices at the end, but it all stays as matrix multiplication and element-wise vector multiplication ("Hadamard products", Nielsen tells us). The partial derivative with respect to a weight matrix element \( W^i_{jk} \) is identical to that for \( b^i_j \) but for an extra \( a^i_k \). (That last subscript \( k \) is not the same as the \( k \) above in \( \partial a^L_k \) etc.) Read Chapter 2 for a full treatment and presentation of this process (or get out a pen and paper), called "back-propagation" because of the way you start with the output layer and then compute the derivatives backwards through the layers. The cost function is notionally defined over the whole training set, so that all of the above work should be placed inside a loop over all training images. But the gradient defined over the whole training set is approximated quite well by a small subset, so the big loop is usually divided into small "batches", with the weights and biases being incremented opposite from the approximate gradient at the end of each batch. Once all batches in a training set have been trained on, that is called an epoch of training, and then you probably go back to the start and train for some more epochs. When using the batches ("stochastic gradient descent"), the order of the training images is shuffled at the start of each epoch. There is nothing preventing the algorithm descending gradients to a local rather than global minimum with this procedure, and I expect some sort of stochastic step in future chapters to overcome this limitation. Data and import The MNIST page separates the handwritten digits into a training set of 60,000 images and a test set of 10,000. Nielsen makes a further separation of the 60,000 into 50,000 training images and 10,000 validation images, to be left unused until a later chapter. He provides a ready-made dataset in a Python pickle format (which I was only able to load into Python 2.7, not 3.4, because of an encoding issue (???) – something to watch out for if you know as little Python as I do and are trying to follow his procedures). Since I wrote my code in R/Rcpp, I wrote some boring code to work from MNIST directly and re-create Nielsen's dataset for R. It also uses the EBImage package to write the images to file. Show boring code.

library(EBImage) in_files = c("digit_data/t10k-images.idx3-ubyte", "digit_data/train-images.idx3-ubyte") hdr_bytes = 16 images = list() length(images) = 2 for (j in seq_along(in_files)) { in_file = in_files[j] file_size = file.size(in_file) file_con = file(in_file, "rb") binary_data = as.numeric(readBin(file_con, n=file_size, "raw")) / 255 close(file_con) num_images = (file_size - hdr_bytes) / 784 images[[j]] = list() length(images[[j]]) = num_images offset = 0 if (j == 1) { out_folder = "test_images" } if (j == 2) { out_folder = "training_images" } for (i in 1:num_images) { if (i %% 100 == 0) { # Progress update: print(i) } start_i = (i - 1) * 784 + 17 end_i = start_i + 783 this_vector = binary_data[start_i:end_i] images[[j]][[i]] = this_vector # The data has white pen strokes on a black background, so flip for # the image file: this_image = Image(1 - matrix(this_vector, ncol=28, nrow=28)) if (j == 2) { if (i == 50001) { offset = -50000 out_folder = "validation_images" } } out_file = sprintf("digit_data/%s/%05d.png", out_folder, i + offset) writeImage(this_image, out_file) } } length(images) = 3 images[[3]] = list() for (i in 50001:60000) { images[[3]][[i - 50000]] = images[[2]][[i]] } length(images[[2]]) = 50000 names(images) = c("test", "training", "validation") # Now the labels in_files = c("digit_data/t10k-labels.idx1-ubyte", "digit_data/train-labels.idx1-ubyte") hdr_bytes = 8 digits = list() length(digits) = 3 for (j in seq_along(in_files)) { in_file = in_files[j] file_size = file.size(in_file) file_con = file(in_file, "rb") digits[[j]] = as.numeric(readBin(file_con, n=file_size, "raw")) close(file_con) digits[[j]] = digits[[j]][(hdr_bytes + 1):file_size] } digits[[3]] = digits[[2]][50001:60000] digits[[2]] = digits[[2]][1:50000] # Convert the digits into vectors of length 10 where the digit's # entry is set to 1 and all others to 0. vec_digits = list() length(vec_digits) = length(digits) for (i in 1:length(vec_digits)) { vec_digits[[i]] = list() for (j in 1:length(digits[[i]])) { vec_digits[[i]][[j]] = numeric(10) vec_digits[[i]][[j]][digits[[i]][j] + 1] = 1 } } names(digits) = c("test", "training", "validation") names(vec_digits) = c("test", "training", "validation") save(list=c("images", "digits", "vec_digits"), file="digit_data/r_images_nielsen.rda")

Here is what the first handful of test images look like, roughly (this is a screenshot from Ubuntu's file manager): Code I store the neural network as an R list, which is a convenient data structure that Rcpp handles after a fashion. Here is the R code, which calls some Rcpp functions that I'll show afterwards:

library(Rcpp) library(ggplot2) sourceCpp("train_test_network.cpp") load("digit_data/r_images_nielsen.rda") new_network = function(n) { # Initialises a neural network. # n is vector containing the number of neurons in each layer num_layers = length(n) if (num_layers < 2) { stop("Need longer input to new_network()") } network = list() length(network) = num_layers for (i in seq_along(n)) { this_n = n[i] if (i < num_layers) { next_n = n[i+1] } # Parameter that I might play with sometimes: k = 1 network[[i]] = list() network[[i]]$a = 0*rnorm(this_n) network[[i]]$bias = k*rnorm(this_n) if (i < num_layers) { next_n = n[i+1] network[[i]]$weight = matrix(k*rnorm(this_n*next_n), nrow=next_n, ncol=this_n) } } return(network) } start_network = new_network(c(784, 30, 10)) out_network = train_network(images$training, vec_digits$training, start_network, images$test, vec_digits$test, 3, 10, 20) test_answers = test_network(images$test, out_network) test_answers_vec = sapply(test_answers, function(x) { which(x == max(x)) - 1 }) true_answers = digits$test qplot(true_answers, test_answers_vec, alpha=.1, position=position_jitter()) + guides(alpha=FALSE) print(length(which(true_answers == test_answers_vec)))

The bulk of the code is in Rcpp. I wrote almost everything in element-by-element for loops, though I suspect you could vectorise some of it, perhaps with RcppArmadillo. Editing the contents of a List structure is a little bit painful: you make a shallow copy of the part of the list you want to edit, and then fill that shallow copy in element by element I couldn't get it to work directly setting the shallow copy equal to, e.g., some other NumericVector. But I'm a very, very long way from an expert on Rcpp. The important routines are feed_forward() , test_network() , and train_network() . Various parameters are passed to train_network() , including the learning rate \( \eta \), which defines how far to increment all variables in the direction opposite the gradient of the cost function, the size of the batches for the approximate gradient calculations, and the number of epochs to train for. As well as training data, the test data is also passed to train_network() , so that the progress of the training can be tracked epoch by epoch. The fiddliness with the List structures and my use of for loops rather than vectorised or matrix calculations means that my code is quite sprawling compared to Nielsen's Python. My code is also about 20% slower, which probably says something about the relative abilities of me and the people who made numpy.

#include <Rcpp.h> using namespace Rcpp; double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); } NumericVector clone_vector(NumericVector input) { long N = input.size(); NumericVector v(N); for (long i = 0; i < N; i++) { v[i] = input[i]; } return v; } NumericMatrix clone_matrix(NumericMatrix input) { long N1 = input.nrow(); long N2 = input.ncol(); NumericMatrix M(N1, N2); for (long i = 0; i < N1; i++) { for (long j = 0; j < N2; j++) { M(i, j) = input(i, j); } } return M; } // http://gallery.rcpp.org/articles/stl-random-shuffle/ // wrapper around R's RNG such that we get a uniform distribution over // [0,n) as required by the STL algorithm inline int rand_wrapper(const int n) { return floor(unif_rand()*n); } NumericVector shuffle(NumericVector a) { // clone a into b to leave a alone NumericVector b = clone(a); std::random_shuffle(b.begin(), b.end(), rand_wrapper); return b; } List feed_forward(NumericVector input, List network) { int num_layers = network.size(); List this_list, prev_list; NumericVector a, prev_a, b; long i, j, k; double sum_z; for (i = 0; i < num_layers; i++) { this_list = network[i]; a = this_list["a"]; if (i == 0) { for (j = 0; j < input.size(); j++) { a[j] = input[j]; } } else { prev_list = network[i - 1]; prev_a = prev_list["a"]; b = this_list["bias"]; NumericMatrix W = prev_list["weight"]; for (j = 0; j < a.size(); j++) { sum_z = 0; for (k = 0; k < prev_a.size(); k++) { sum_z += W(j, k) * prev_a[k]; } sum_z += b[j]; a[j] = sigmoid(sum_z); } } } return network; } long max_i(NumericVector v) { long N = v.size(); double max_v = max(v); for (long i = 0; i < N; i++) { if (v[i] == max_v) { return i; } } return -1; } // [[Rcpp::export]] List test_network(List test_images, List network) { long N = test_images.size(); long num_layers = network.size(); List answers(N); List temp_list; temp_list = network[num_layers - 1]; NumericVector a_out = clone_vector(temp_list["a"]); long num_out = a_out.size(); for (long i = 0; i < N; i++) { NumericVector this_test = test_images[i]; network = feed_forward(this_test, network); answers[i] = NumericVector(num_out); NumericVector v = answers[i]; temp_list = network[num_layers - 1]; NumericVector temp_v = clone_vector(temp_list["a"]); for (long j = 0; j < num_out; j++) { v[j] = temp_v[j]; } } return answers; } // [[Rcpp::export]] List train_network(List training_set, List training_outputs, List network, List test_set, List test_outputs, double eta, long batch_size, long num_epochs) { long i, j, k, train_ct, i_train, max_this_batch, i_epoch, i_layer; double this_deriv_part, this_deriv_term; NumericVector next_a_term, this_a_term; NumericMatrix aux_W; List aux_list; int num_layers = network.size(); long num_training_images = training_set.size(); long num_test_images = test_set.size(); long num_test_correct; long num_neurons[num_layers]; List delta_W(num_layers - 1); List delta_b(num_layers); List helper_delta_b(num_layers); List temp_list; for(i = 0; i < num_layers; i++) { // Rcpp doesn't seem to like combining the following three lines: temp_list = network[i]; NumericVector this_layer = temp_list["a"]; num_neurons[i] = this_layer.size(); delta_b[i] = NumericVector(num_neurons[i]); if (i > 0) { delta_W[i - 1] = NumericMatrix(num_neurons[i], num_neurons[i - 1]); } } NumericVector order_images(num_training_images); for (i = 0; i < num_training_images; i++) { order_images[i] = i; } for (i_epoch = 0; i_epoch < num_epochs; i_epoch++) { if (i_epoch % 1 == 0) { Rprintf("Starting epoch %d

", i_epoch); } order_images = shuffle(order_images); train_ct = 0; do { // Re-initialise the increment lists: for (i = 0; i < num_layers; i++) { delta_b[i] = NumericVector(num_neurons[i]); if (i > 0) { delta_W[i - 1] = NumericMatrix(num_neurons[i], num_neurons[i - 1]); } } max_this_batch = train_ct + batch_size; if (max_this_batch > num_training_images) { max_this_batch = num_training_images; } for (i_train = train_ct; i_train < max_this_batch; i_train++) { NumericVector this_training = training_set[(long) order_images[i_train]]; network = feed_forward(this_training, network); NumericVector this_training_output = training_outputs[(long) order_images[i_train]]; for (i = 0; i < num_layers; i++) { helper_delta_b[i] = NumericVector(num_neurons[i]); } List temp_list = network[num_layers - 1]; NumericVector output_a = clone_vector(temp_list["a"]); NumericVector output_d_sig = clone_vector(temp_list["a"]); // Start with the output layer. NumericVector this_helper = helper_delta_b[num_layers - 1]; for (j = 0; j < num_neurons[num_layers - 1]; j++) { output_d_sig[j] = output_d_sig[j] * (1.0 - output_d_sig[j]); this_helper[j] = (output_a[j] - this_training_output[j]) * output_d_sig[j]; } // Back-propagate: for (i = num_layers - 2; i >= 1; i--) { NumericVector this_helper = helper_delta_b[i]; NumericVector next_helper = helper_delta_b[i + 1]; List temp_list = network[i]; NumericMatrix W = temp_list["weight"]; NumericVector this_a = clone_vector(temp_list["a"]); for (j = 0; j < num_neurons[i]; j++) { this_helper[j] = 0.0; for (k = 0; k < num_neurons[i + 1]; k++) { this_helper[j] += next_helper[k] * W(k, j) ; } this_helper[j] *= this_a[j] * (1.0 - this_a[j]); } } for (i = num_layers - 1; i >= 1; i--) { temp_list = network[i - 1]; NumericVector prev_a = clone_vector(temp_list["a"]); NumericVector this_helper = helper_delta_b[i]; NumericVector this_delta_b = delta_b[i]; NumericMatrix this_delta_W = delta_W[i - 1]; for (j = 0; j < num_neurons[i]; j++) { for (k = 0; k < num_neurons[i - 1]; k++) { this_delta_W(j, k) += this_helper[j] * prev_a[k]; } this_delta_b[j] += this_helper[j]; } } } // Increment weights and biases: for (i = 0; i < num_layers - 1; i++) { List temp_list = network[i]; NumericMatrix W = temp_list["weight"]; List temp_list2 = network[i + 1]; NumericVector b = temp_list2["bias"]; NumericMatrix this_delta_W = delta_W[i]; NumericVector this_delta_b = delta_b[i + 1]; for (j = 0; j < num_neurons[i + 1]; j++) { for (k = 0; k < num_neurons[i]; k++) { W(j, k) -= eta*this_delta_W(j, k); } b[j] -= eta*this_delta_b[j]; } } train_ct = max_this_batch; } while (train_ct < num_training_images - 1); // See how we're doing. List test_answers = test_network(test_set, network); num_test_correct = 0; for (i = 0; i < num_test_images; i++) { NumericVector this_ans = test_answers[i]; NumericVector this_true = test_outputs[i]; if (max_i(this_ans) == max_i(this_true)) { num_test_correct++; } } Rprintf("Score of %d out of %d

", num_test_correct, num_test_images); } return network; }