Neural networks can seem like a bit of a black box. But in some ways, a neural network is little more than several logistic regression models chained together.

In this post I will show you how to derive a neural network from scratch with just a few lines in R. If you don’t like mathematics, feel free to skip to the code chunks towards the end.

This blog post is partly inspired by Denny Britz’s article, Implementing a Neural Network from Scratch in Python, as well as this article by Sunil Ray.

Logistic regression What’s a logistic regression model? Suppose we want to build a machine that classifies objects in two groups, for example ‘hot dog’ and ‘not hot dog’. We will say \(Y_i = 1\) if object i is a hot dog, and \(Y_i = 0\) otherwise. A logistic regression model estimates these odds, \[ \operatorname{odds}(Y = 1) = \frac{\operatorname P(Y = 1)} {\operatorname P(Y = 0)} = \frac{\operatorname P(Y = 1)} {\operatorname 1 - \operatorname P(Y = 1)} \] and for mathematical and computational reasons we take the natural logarithm of the same—the log odds. As a generalised linear model, the response (the log odds) is a linear combination of the parameters and the covariates, \[\operatorname{log-odds}(Y = 1) = X \beta,\] where \(X\) is the design matrix and \(\beta\) is a vector of parameters to be found. Classical statistical inference involves looking for a parameter estimate, \(\hat\beta\), that maximises the likelihood of the observations given the parameters. For our hot dog classification problem, the likelihood function is \[\mathcal{L} = \prod_i \operatorname P(Y_i=1)^{y_i} \operatorname P(Y_i = 0)^{1 - y_i}\] or, taking logs, \[\log \mathcal{L} = \sum_i \Bigl[ y_i \log \operatorname P(Y_i = 1) + (1-y_i) \log \operatorname P(Y_i = 0) \Bigr]. \] Let’s imagine we have been given some two-dimensional data about a sample of objects, along with their labels. Our sample data set includes 200 observations. Here is a random sample of five: x1 x2 class 8.81 -6.19 not hot dog 10.00 0.84 not hot dog -1.78 4.48 hot dog 6.83 -0.02 hot dog 1.71 -10.29 not hot dog And a scatter plot of all of them: Fitting a logistic regression model in R is easy: logreg <- glm(class ~ x1 + x2, family = binomial, data = hotdogs) But as the decision boundary can only be linear, it doesn’t work too well at distinguishing our two classes. Logistic regression classifies 134 (67%) of our objects correctly. Nonlinearity is where neural networks are useful.

Artificial neural networks An artificial neural network is so called because once upon a time it was thought to be a good model for how neurons in the brain work. Apparently it isn’t, but the name has stuck. Suppose we have some inputs, \(\mathbf x\), and known outputs \(\mathbf y\). Then the aim of the game is to find a way of estimating \(\mathbf y\) based on \(\mathbf x\). In a way, then, a neural network is like any other regression or classification model. Neural networks (for classification, at least) are also known as multi-layer perceptrons. Like ogres and onions, they have layers—three to be exact. The output layer is our estimate of the probability that objects belong to each class. The input layer comprises the covariates and an intercept, as before. In the middle, there is a hidden layer, which is a transformation of the input space into \(h\) dimensions, where \(h\) is a number chosen by us. We then perform a logistic regression on this transformed space to estimate the classes. It works like this. Generate \(h\) different linear combinations of the input variables. Apply an ‘activation’ function, that for each observation, turns each hidden node ‘on’ or ‘off’. Fit a logistic regression model to these \(h\) transformed predictors, plus an intercept. Adjust the parameters of both the input and the output to maximise likelihood. Repeat, ad nauseum. If \(h = 1\) then there is only one linear combination of the predictors, which is effectively the same thing as having no hidden layer at all, i.e. ordinary logistic regression. So we run a kind of logistic regression model on the inputs to generate a transformation of the feature space, then once more to classify our actual objects. Each iteration of the fitting or ‘training’ process adjusts both the transformation and the regression parameters.

Hidden layers The input layer is a design matrix, \(\mathbf X = \begin{bmatrix}\mathbf 1 & \mathbf x\end{bmatrix}\). The output layer is a vector of estimated probabilities, \(\hat {\mathbf y}\). So far, this is exactly like our logistic regression model above. A neural network adds a hidden layer, which you might think of as an intermediate design matrix between the inputs and the outputs. Deep learning is just a neural network with multiple hidden layers. If there are \(d\) input variables then there are \(d+1\) input nodes—one for each covariate, plus an intercept, or ‘bias’ (because computer scientists like having separate names for everything). In general there are \(k-1\) output nodes, where \(k\) is the number of possible classes. The \(k^\text{th}\) node would be the same as ‘none of the above’, so it is redundant. In a binary classification problem like ours, there is just one output node, because \(\operatorname P(Y=0) = 1 - \operatorname P(Y=1)\). There are \(h\) nodes in the hidden layer, plus an intercept. Each of these is a linear combination of the inputs, passed through an activation function. The intercept does not depend on the nodes in the previous layer . The activation function is often (not always) chosen to be the sigmoid function, another name for the logistic function, \[\sigma(x) = \frac 1 {1 + e^{-x}},\] the inverse of the logit \[\operatorname{logit}(x) = \log \left( \frac{x}{1-x} \right).\] If \(x\) is a probability of an event, then \(\operatorname{logit}(x)\) is its log odds.

Forward propagation Starting with the inputs, we feed forward through the network as follows. Firstly, compute a linear combination of the covariates, using some weight matrix \(\mathbf W_\text{in} \in \mathbb R^{(d+1) \times h}\). \[ \mathbf z_1 = \mathbf{XW}_\text{in} = \begin{bmatrix}\mathbf 1 & \mathbf x\end{bmatrix} \mathbf W_\text{in} \] Next, apply an activation function to obtain the nodes in the hidden layer. The hidden layer \(\mathbf H\) might be thought of as a design matrix containing the output of a logistic regression classifying whether each node is ‘activated’ or not. \[\mathbf h = \sigma(\mathbf z_1)\] The intercept/bias is always activated, so it is fixed to be a vector of ones. \[\mathbf H = \begin{bmatrix} \mathbf 1 & \mathbf h \end{bmatrix} = \begin{bmatrix} \mathbf 1 & \sigma(\mathbf z_1) \end{bmatrix} = \begin{bmatrix} \mathbf 1 & \sigma(\mathbf {XW}_\text{in}) \end{bmatrix}\] For the output layer, compute a linear combination of the hidden variables, this time using another weight matrix, \(\mathbf{W}_\text{out} \in \mathbb R^{(h+1) \times (k-1)}\). \[\mathbf z_2 = \mathbf {HW}_\text{out} = \begin{bmatrix} \mathbf 1 & \mathbf h\end{bmatrix} \mathbf W_\text{out} \] Apply one more function to get the output \[\hat {\mathbf y} = \sigma (\mathbf z_2),\] which is a probability vector, \(\hat Y_i = \operatorname P(Y_i = 1)\). Putting it all together, \[\hat {\mathbf y} = \sigma \left( \mathbf {H W} _ \text{out} \right) = \sigma \bigl( \begin{bmatrix} \mathbf 1 & \sigma ( \mathbf {X W} _ \text{in} ) \end{bmatrix} \mathbf W _ \text{out} \bigr).\] It is straightforward to write a function to perform the forward propagation process in R. Just do feedforward <- function(x, w1, w2) { z1 <- cbind(1, x) %*% w1 h <- sigmoid(z1) z2 <- cbind(1, h) %*% w2 list(output = sigmoid(z2), h = h) } where sigmoid <- function(x) 1 / (1 + exp(-x)) Where do we get the weights from? On the first iteration, they can be random. Then we have to make them better.

Back propagation So far we have been taking the parameters, or weights, \(\mathbf W_\text{in}\) and \(\mathbf W_\text{out}\), for granted. Like parameters in a linear regression, we need to choose weights that make our model ‘better’ by some criterion. Neural network enthusiasts will say that we will train our multilayer perceptron by minimising the cross entropy loss. That’s a fancy way of saying we fit the model using maximum likelihood. The log-likelihood for a binary classifier is \[\ell = \sum_i \Bigl( y_i \log \hat y_i + (1 - y_i) \log (1 - \hat y_i) \Bigr).\] We can maximise this via gradient descent, a general-purpose optimisation algorithm. To minimise \(\ell = f(\mathbf W)\) via gradient descent, we iterate using the formula \[\mathbf W_{t+1} = \mathbf W_{t} - \gamma \cdot

abla f(\mathbf W_{t}),\] where \(\mathbf W_t\) is the weight matrix at time \(t\), \(

abla f\) is the gradient of \(f\) with respect to \(\mathbf W\) and \(\gamma\) is the ‘learning rate’. Choose a learning rate too high and the algorithm will leap around like a dog chasing a squirrel, going straight past the optimum; choose one too low and it will take forever, making mountains out of molehills. Using the chain rule, the gradient of the log-likehood with respect to the output weights is given by \[\frac {\partial\ell} {\partial \mathbf W_\text{out}} = \frac{\partial \ell}{\partial \hat {\mathbf y}} \frac{\partial \hat {\mathbf y} }{\partial \mathbf W_\text{out}}\] where \[\begin{aligned} \frac{\partial\ell}{\partial \hat{\mathbf y}} &= \frac{\mathbf y}{\hat{\mathbf y}} - \frac{1 - \mathbf y}{1 - \hat{\mathbf y}} = \frac{\hat{\mathbf y} - \mathbf y}{\hat{\mathbf y}(1 - \hat{\mathbf y})},\\ \frac{\partial\hat{\mathbf y}}{\partial \mathbf W_\text{out}} &= \mathbf{H}^T \sigma(\mathbf {HW}_\text{out})\bigl( 1 - \sigma(\mathbf{HW}_\text{out}) \bigr) \\ &= \mathbf H^T \hat {\mathbf y} (1 - \hat{\mathbf y}). \end{aligned}\] And the gradient with respect to the input weights is \[ \frac {\partial\ell} {\partial \mathbf W_\text{in}} = \frac{\partial \ell}{\partial \hat {\mathbf y} } \frac{\partial \hat {\mathbf y} }{\partial \mathbf H} \frac{\partial \mathbf H}{\partial \mathbf W_\text{in}} \] where \[ \begin{aligned} \frac{\partial \hat{\mathbf y}}{\partial \mathbf H} &= \sigma(\mathbf{HW}_\text{out}) \bigl( 1 - \sigma(\mathbf{HW}_\text{out}) \bigr) \mathbf W_\text{out}^T \\ &= \hat{\mathbf y} (1 - \hat{\mathbf y}) \mathbf W_\text{out}^T, \\ \frac{\partial \mathbf H}{\partial \mathbf W_\text{in}} &= \mathbf X^T \begin{bmatrix} \mathbf 0 & \sigma(\mathbf{XW}_\text{in})\bigl( 1 - \sigma(\mathbf{XW}_\text{in}) \bigr) \end{bmatrix}. \end{aligned} \] In the last step, we take for granted that the intercept column of \(\mathbf H\) doesn’t depend on \(\mathbf W_\text{in}\), but you could parametrise it differently (see footnotes). A simple R implementation is as follows. backpropagate <- function(x, y, y_hat, w1, w2, h, learn_rate) { dw2 <- t(cbind(1, h)) %*% (y_hat - y) dh <- (y_hat - y) %*% t(w2[-1, , drop = FALSE]) dw1 <- t(cbind(1, x)) %*% (h * (1 - h) * dh) w1 <- w1 - learn_rate * dw1 w2 <- w2 - learn_rate * dw2 list(w1 = w1, w2 = w2) }

Training the network Training is just a matter of initialising the weights, propagating forward to get an output estimate, then propagating the error backwards to update the weights towards a better solution. Then iteratively propagate forwards and backwards until the Earth has been swallowed by the Sun, or some other stopping criterion. Here is a quick implementation using the functions defined above. train <- function(x, y, hidden = 5, learn_rate = 1e-2, iterations = 1e4) { d <- ncol(x) + 1 w1 <- matrix(rnorm(d * hidden), d, hidden) w2 <- as.matrix(rnorm(hidden + 1)) for (i in 1:iterations) { ff <- feedforward(x, w1, w2) bp <- backpropagate(x, y, y_hat = ff$output, w1, w2, h = ff$h, learn_rate = learn_rate) w1 <- bp$w1; w2 <- bp$w2 } list(output = ff$output, w1 = w1, w2 = w2) } Let’s train a neural network with five hidden nodes (like in Figure 1) on the hot dogs data. x <- data.matrix(hotdogs[, c('x1', 'x2')]) y <- hotdogs$class == 'hot dog' nnet5 <- train(x, y, hidden = 5, iterations = 1e5) On my desktop PC, it takes about 12 seconds for the above code to run the 100,000 iterations. Not too bad for what sounds like a lot of runs, but what are the results like? We can calculate how many objects it classified correctly: mean((nnet5$output > .5) == y) ## [1] 0.76 That’s 76%, or 152 out of 200 objects in the right class. Let’s draw a picture to see what the decision boundaries look like. To do that, we firstly make a grid of points around the input space: grid <- expand.grid(x1 = seq(min(hotdogs$x1) - 1, max(hotdogs$x1) + 1, by = .25), x2 = seq(min(hotdogs$x2) - 1, max(hotdogs$x2) + 1, by = .25)) Then feed these points forward through our trained neural network. ff_grid <- feedforward(x = data.matrix(grid[, c('x1', 'x2')]), w1 = nnet5$w1, w2 = nnet5$w2) grid$class <- factor((ff_grid$output > .5) * 1, labels = levels(hotdogs$class)) Then, using ggplot2 , we plot the predicted classes on a grid behind the observed points. ggplot(hotdogs) + aes(x1, x2, colour = class) + geom_point(data = grid, size = .5) + geom_point() + labs(x = expression(x[1]), y = expression(x[2])) The regions the neural network has classified into ‘hot dog’ and ‘not hot dog’ can no longer be separated by a single straight line. The more nodes we add to the hidden layer, the more elaborate the decision boundaries can become, improving accuracy at the expense of computation time (as more weights must be calculated) and increased risk of over-fitting the data. How about 30 nodes? nnet30 <- train(x, y, hidden = 30, iterations = 1e5) After 100,000 iterations, accuracy is 100%. The decision boundary looks much smoother: And for completeness, let’s see what one hidden node (plus an intercept) gives us. nnet1 <- train(x, y, hidden = 1, iterations = 1e5) Much worse accuracy—68%—and the decision boundary looks linear.