Introduction

Recap

I’m presuming that you’ve read my introduction to neural networks but I’ll do a quick recap here. A neural network consists of neurons arranged in layers where every neuron in a layer is connected to every neuron in the next layer. A neuron multiplies the data that is passed into it by a number called weight and then adds a number called a bias. These weights and biases for each neuron are adjusted incrementally to try to decrease the loss (the average amount the network is wrong by across all the training data).

What we are going to be doing

Last time looked at how neural networks process data, this looks at practical considerations that you will need to know about when creating, training, and using neural nets. I use Keras for creating, training, and using neural nets but I’ll try to keep the concepts discussed here as generic as possible

Formatting training data

All data passed to a neural network must be numerical as else it can’t process the data (by doing multiplication and addition) so if you have any categorical data you should assign each category a number eg. for rock paper scissors rock is 1 paper is 2 and scissors is 3)

All data passed to a neural net must be of the same length (as the weights for each layer are a fixed size and need to be the same length as the input data else the dot product cant be taken)

Categorical output data (and in some cases categorical input data) should be what is called one-hot encoded. To one hot encode 4 you would change it to 0,0,0,0,1 notice how the 4th column (0th 1st 2nd 3rd 4th) has the 1 in. Just to make it abundantly clear 2 would be 0,0,1,0,0 as the one is in the second column

Split your data into train and test data. The purpose of this is to stop the network from overfitting (overfitting is when the network essentially memories which output it should produce for each input in the training set meaning that it can’t respond to any other data so becomes useless but as it has got the loss down to 0 it thinks it’s done a great job). Train data is as the name suggests be used for training but the test data won’t be used to change parameters, the test loss will still be computed but parameters will not be changed as a result of the test loss, this allows us humans to compare the accuracy and loss on the train data to the test data so we can spot overfitting

Yes there are ML memes

How many layers? How many neurons?

Contrary to what some people on stack overflow ill tell you there is no magic formula for the number of neurons you should have on each layer nor for the number of layers you should have you have. But generally, the first layer has the most neurons and the number of neurons slowly decreases for each subsequent layer

The more neurons you have the faster the network will overfit

Too few neurons training will take forever and may never result in an accurate net

Generally, you want your number of neurons to be equal to or greater than the length of each piece of input data

Generally you want either one or two hidden layers (giving either three or four layers in total) in most feedforward neural nets

Activation functions

To avoid making the previous post too lengthy I omitted activation functions. The activation function is applied after the weight and the bias. The main ones that you should know about are sigmoid, softmax, tanh and relu.

Excuse the photoshop but there were no good graphs with all of them on so I merged the graphs

Sigmoid squashes the number into the range 0–1 and is an s shape, it has the disadvantage that as you approached higher numbers it begins to flatline and this vanishing gradient problem makes weight updates very small causing the network to stop learning.

Softmax is a variation on sigmoid that ensures that all the neurons outputs add up to one, this is the activation that you should use on your last layer if you are outputting one hot encoded data as therefore the number given fro each column (each category) can be seen as a probability of it being that category

Tanh is basically sigmoid but -1 to 1 instead of 0 to 1, it is generally preferred over sigmoid as it has slightly less extreme gradients and can output negative values. It is seen as quite 1990s and not the best to use if you can avoid it, which brings me on to rely

Relu is the most commonly used activation function and the most modern. Relu is popular as it is linear in the positive and 0 in the negative, the reason why this is advantageous is that it requires less computation. With all the others you have to deal with curves and complex maths which takes a lot more computation than just doing if num>0{return num}; else{ return 0};

TLDR on activation functions: softmax is often the best pick for the last layer but should be avoided elsewhere, tanh is best avoided if possible but OK to use anywehere if you have to, relu is the best for all layers other than the last

Loss functions

For one hot encoded data use categorical_crossentropy which is just a generation of binary_cross entropy for any number of classes. I’ve made a graph of binary cross entropy so that you can move the t (target output) and o (actual output) sliders and see what the loss would be. The formula for categorical_crossentropy is -1 * ((target output of output neuron 1 * naturallogof(actual output of output neuron 1)) + (target output of output neuron 2* naturallogof(actual output of output neuron 2)) … for all output neurons)

t=target, o=actual output, subscripts=neuron number

Optimizers

Use Adam always

Adam is just gradient descent with more maths and in essence (and I may well be wrong on this) it adds momentum which avoids paramter updates oscillating back and forth by smoothing them out a bit by making the updates bigger than plain gradient decent says they should be if they are in the same direction as the last update to that parameter and making them smaller than plain gradient decent says they should be if they are in the opposite direction to the last update to that parameter

You will see academic papers proposing new ever so slightly different optimizers that claim to have beaten Adam, ignore them the difference in accuracy is normally small

Although other optimizers can beat Adam when well-tuned these require a lot of adjustment whereas for most things Adam can do a fairly good job left to its defaults.

TLDR on Adam: it’s like a reliable car, once people start to modify their cars it definitely isn’t the best car but it does a good job most of the time without needing anything doing to it.

Batch size

Rather than calculating the loss and by extension parameter updates for the whole dataset it is common practise to split the dataset up into batches, and calculate the loss and change parameters after each batch. This allows training to occur more quickly as the parameters will have already been adjusted (hopefully closer to the optimum values) when the next batch is processed which then further fine-tunes the parameters meaning you get several adjustments in each epoch (pass over the dataset)

Too low a batch size and each epoch will take ages as parameters adjustments are having to occur constantly and your network may not learn anything as it will overfit to each batch causing the parameters to vary wildly as they overfit to each different batch

Too high a batch size and you risk running out of memory and thereby causing your computer to crash. Furthermore, it will take more epochs to train the model to a reasonable standard.

Recommended reading/watching:

https://stats.stackexchange.com/a/220563 — more on Adam

https://www.youtube.com/watch?v=-7scQpJT7uo&t=532s — more on activation functions by a good YouTube for beginners in this space

https://stats.stackexchange.com/a/211359 — more on pros and cons of activation functions

https://towardsdatascience.com/exploring-activation-functions-for-neural-networks-73498da59b02 — more on activation functions in text form and with experimental results

https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9 — more on epochs and batch sizes

Please share this post on social media if you enjoyed it/found it useful. If there are any inaccuracies in this article please ensure to let me know, so I can improve my knowledge and avoid giving people wrong information. Please feel free to leave feedback in the comments, so I know how to improve for the next post. If you want help with Keras generally use the Keras google group and I or someone else will help you

Gain Access to Expert View — Subscribe to DDI Intel