Neural networks are an exciting subject that I wanted to experiment after that I took up on genetic algorithms. Here is related my journey to implement a neural network in JavaScript, through a visual example to better understand the notion of automatic learning. You can find the complete code of this example and its neural net implementation on Github, as well as the full demo on JSFiddle. The article is divided into three parts:

Governing principle: a circle that learns to follow the mouse How to a make a model which performs well JavaScript implementation

If you are new in the machine learning field, here is some video resources that I would recommend for learning about neural networks:

A circle that learns to follow the mouse

The example is not an example of classification, but rather of linear regression . I drew a circle in an HTML5 canvas and I now wish that this circle learns to follow the movements of the mouse until completely copying its position. I would like to send as input of my artificial intelligence the position of the mouse and recover the output of the position of my circle:

In other words, the neural network (or NN) must learn to output its own input. Mathematically, we try in some way to approximate f([x, y]) = [x', y'] or ‘identity function’. In the literature, this kind of neural network is a special auto-encoder (but our goal is not to reduce dimension).

The advantage of this example is that we are sure that the whole problem can be solved with input neurons and two output neurons. However, at least one intermediate layer (hidden layer ) otherwise it has no interest. The game’s goal is to set up the backpropagation and really understand step by step how a neural net modifies it weights to get to its purpose.

Neural network training steps

The mouse changes its position Feedforwarding: the NN calculates the new position of the circle Backpropagation: the NN adjusts its weights to reduce the difference between the position of the circle and the position of the mouse Restart

In the idea, the position of the circle should tend towards the position of the mouse. The experimentation allow to visualize the learning process since that when the NN converges towards its solution, the circle converges visually towards the location of the mouse. Here’s the result in video:

In this video, I discuss about the neural net training in 2 parts: a first live-training (the user has to move the mouse so that the neural network learns instantaneously) and the second with a pre-training based on a dataset. In fact, the live training onto the video is way slower that the actual model. I invite you to try the experimentation further down in the page to see the difference in performance (or directly on JSFiddle). It was while I was writing the importance of normalization (implementation part of this article) that I suddenly noticed that performance improvements were possible.



Let’s take a look a the neural net of the first part:

What’s interesting is to visualize how the network has finally “distributed” its two entries to the intermediate layer, which it will “reassemble” these two inputs to send them out. One can almost trace a path by looking at the most important weights:

By checking we realize that 0.4 * 0.9165 * 1.1456 ~= 0.41 while -0.3 * 0.8663 * 1.2139 ~= -0.31 . What is interesting to see is how the NN modify itself to focus on one weight instead of two, and how it disregards biases (value between -1 and 1 added to the calculation of each neuron). The result of these weights combination do not give exactly 0.4 or -0.3 because they also correct the parasites generated by others, insignificant weights.

On the other hand, with more neurons and more layers (eg 2 • 6 • 6 • 6 • 2), it’s much harder to identify a given path, since the two x and y information are much more scattered between the neurons and layers.

If we had to make an analogy, then I would say that in our case the NN is a formidable stack of sieves to which one passes grains of sand: well differentiated blue and green before the first sieve, the two types of grains mix in the intermediate layers but finish again separated by their color at the exit of the last layer.

You can even experiment yourself the video example using the JSFiddle below :

Note that it is possible to vary the network’s hyperparameters and change its structure.

Make a model which perform well

Let us now turn our attention to the development of our neural network. It’s possible to improve its learning and to act on the speed of convergence of the circle towards the mouse by adjusting the hyperparameters :

The learning rate, which is an important factor in gradient calculations, at the heart of backpropagation;

which is an important factor in gradient calculations, at the heart of backpropagation; The activation function which “filters” the output value of a neuron;

which “filters” the output value of a neuron; The number of hidden layers , and the amount of neurons on each layer

These parameters are making it possible to improve the performance of the model, but the most determining factor is still the choice of input data and what’s desired at the output.

THE IMPORTANCE OF normalization

In our case there isn’t really too much to think about the nature of the input data and output data: we chose to have in input the mouse’s coordinates and in output the circle’s ones because it’s directly what we have available (input) and what we want (output).

On the other hand, we can’t send the mouse’s coordinates directly without first normalizing them. Knowing that the output error (mean squared error) is calculated such that mse = 1/2 * (target - output)² , with target our wanted value, (target - output)² has to be bounded between 0 and 1 inclusive no matter output in order to prevent the error from increasing exponentially. For this, and knowing that our reference is centered, we divide the values ​​of abscissa by the half width of the canvas and the values ​​of ordinates by half its height.

I invite you to test to normalize with 2 times the norm on each axis: your values ​​will therefore be in the interval [-0.5, 0.5] and the neural net will learn much less quickly because of smaller variations and therefore less “quantified” by the quadratic error.

Without normalization, incoherent error calculations are quickly obtained during backpropagation: errors explode, weights also and quickly the network is saturated. Although in the example the activation function is linear, in many cases ( tanh , sigmoid , etc …) there’s no difference between 2 large values superior to 1. For example, at coordinates (300, 200) <=> (0.257, 0.171) normalized we have:

tanh(300) = 1 while tanh(0.257) = 0.251

while tanh(200) = 1 while tanh(0.171) = 0.169

Hence the importance of normalizing its input values, but also favoring output values ​​between -1 and 1 to avoid the propagation of a too big error.

Learning rate influence

Overall, I often experimentally vary the learning rate on a logarithmic scale: 0.5 , 0.05 , 0.005 , etc.

It’s completely possible to visualize the impact of a learning rate too low on learning (much slower): the circle struggles to follow the mouse and the live-training can really take a long time. By using pre-training, the difference is however hardly noticeable.

Similarly, a learning rate too large prevents the network of neurons from going to the end of its learning because of too big gradients in the backpropagation. This happens especially when there are a lot of intermediaries between input and output (ie. when there are many hidden layers and neurons).

choose the activation function

I have to say that I have not set up many different activation functions on the intermediate layers (sigmoid, tanh and ReLU) and my tests have shown that to solve this problem, the neural network works terribly better with a simple linear function. Tanh showed results, however not very interesting because of a much longer learning time than the linear. Sigmoid and ReLU showed no results.

And this makes sense: sigmoid works badly because we work in the interval [-1, 1] while the sigmoid arrival set is [0, 1[. Similarly, ReLU does not take into account our negative values. These activation functions probably work better on classification problems but limit too much the values ​​on a linear regression problem.

influence of the quantity of neurons

The above ‘sieve analogy’ is even more impressive (almost magical) when increasing the number of layers and neurons. Concretely, because of the simplicity of our example there’s no interest in increasing the number of intermediate layers apart from adding computational complexity and slowness.

Nevertheless, it is always impressive to see how the network can still find a solution even with a slightly deeper topology such as 2 • 6 • 6 • 6 • 6 • 6 • 6 • 6 • 2. In this case, it’s no longer possible to train live as in the video, unless you have a lot of time… Don’t forget to activate backpropagation :



JavaScript Implementation

This whole example has been the governing principle to carry out my own implementation of a neural network in Javascript. The choice of Javascript is simply to have something working in the browser and the ability to create an interactive visualization.

Finally, this is the starting point of a reusable implementation -or mini library- for other projects, of which you can find the integral code on Github.

Why not use an existing library?

I have tested ConvNetJS written by Andrej Karpathy, but impossible to make it work initially due to my lack of knowledge. I needed to put my hands in the sludge myself to actually know how to handle a neural network library. I haven’t tried to use the library or another since, it would be interesting to compare results and performance with my implementation. But I also learned that every library (no matter what language / platform) makes its own algorithm implementation choices and that will ensure significant differences depending on the choices, making it difficult to compare.

I wanted this implementation to be simple and focused around the basic idea of the main algorithms (feedforwarding , backpropagation , etc …) so that another beginner can, by reading the code, quickly understand. On the other hand, setting up a web worker to do the pre-training was necessary to avoid blocking the main thread in the browser for the pre-training, and thus have a better experience.

DaTa structure

Neurons and network are in the form of prototyped objects:

function Neuron(id, layer, biais) { this.id = id; this.layer = layer; this.biais = biais || 0; this.dropped = false; this.output = undefined; this.error = undefined; this.activation = undefined; this.derivative = undefined; }; //////////////////////////////////////////// function Network(params) { // Required variables: lr, layers this.lr = undefined; // Learning rate this.layers = undefined; this.hiddenLayerFunction = undefined; // activation function for hidden layer this.neurons = undefined; this.weights = undefined; // ... load params }

Weight and neurons are stored in 2 one-dimensional arrays for speed and flexibility reasons: I don’t find that multi-dimensional arrays are always the easiest to handle and operations on a one-dimensional array are generally more optimized by JS engines.

Random Selection

Weights and bias are initialized between -1 and 1. According to my tests, picking up weights doesn’t have too much influence on performance, because in majority all converge towards their good value rather quickly (depending on the learning rate).

function randomBiais() { return Math.random() * 2 - 1; } function randomWeight() { return Math.random() * 2 - 1; }

On the other hand, it’s necessary to take into account biases as weights in backpropagation and to adjust their value so that the network comes to a solution (since the function to be approximated is identity and doesn’t require constants in the equations). Thus, it’s possible for this example to disable biais ( randomBiais = () => 0; ).

Feedforward AND backpropagation

Feed-forwarding is simple. The only implemented “optimization” is to avoid recovering for each neuron the array of neurons of the previous layer. This array is kept until current layer changes (lines 14/15 below). I first handle inputs neurons as special cases, then feed-forward through all the global neurons array (hence the interest of having a one-dimensional array) from the second layer:

// Input layer filling for (index = 0; index < this.layers[0]; index++) this.neurons[index].output = inputs[index]; // Fetching neurons from second layer (even if curr_layer equals 0, it'll be changed directly) for (index = this.layers[0]; index < this.nbNeurons; index++) { neuron = this.neurons[index]; if (neuron.dropped) continue; // Update if necessary all previous layer neurons. It's a cache if (prev_neurons === undefined || neuron.layer !== curr_layer) prev_neurons = this.getNeuronsInLayer(curr_layer++); // Computing w1*x1 + ... + wn*xn for (sum = 0, n = 0, l = prev_neurons.length; n < l; n++) { if (!prev_neurons[n].dropped) sum += this.getWeight(prev_neurons[n], neuron) * prev_neurons[n].output; } // Updating output neuron.output = neuron.activation(sum + neuron.biais); }

The backpropagation implementation is more dense. Again, I prefer to first handle particular neurons (the output ones) to calculate the error, then to iterate on each neuron to recalculate each weight:

// Output layer error computing: err = (expected-obtained) for (n = 0, l = outputs_neurons.length; n < l; n++) { neuron = outputs_neurons[n]; grad = neuron.derivative(neuron.output); err = targets[n] - neuron.output; neuron.error = grad * err; output_error += Math.abs(neuron.error); // Update biais neuron.biais = neuron.biais + this.lr * neuron.error; } this.outputError = output_error; // Fetching neurons from last layer for (index = this.layersSum[curr_layer-1] - 1; index >= 0; index--) { neuron = this.neurons[index]; // Dropping neuron is a technique to add dynamic into training if (neuron.dropped) continue; // Update if necessary all next layer neurons. It's a cache if (next_neurons === undefined || neuron.layer !== curr_layer) next_neurons = this.getNeuronsInLayer(curr_layer--); // Computing w1*e1 + ... + wn*en for (sum = 0, n = 0, l = next_neurons.length; n < l; n++) { if (!next_neurons[n].dropped) sum += this.getWeight(neuron, next_neurons[n]) * next_neurons[n].error; } // Updating error neuron.error = sum * neuron.derivative(neuron.output); this.globalError += Math.abs(neuron.error); // Update biais neuron.biais = neuron.biais + this.lr * neuron.error; // Updating weights w = w + lr * en * output for (n = 0, l = next_neurons.length; n < l; n++) { if (next_neurons[n].dropped) continue; weight_index = this.getWeightIndex(neuron, next_neurons[n]); // Update current weight weight = this.weightsTm1[weight_index] + this.lr * next_neurons[n].error * neuron.output; // Update maxWeight (for visualisation) max_weight = max_weight < Math.abs(weight) ? Math.abs(weight) : max_weight; // Finally update weights this.weights[weight_index] = weight; } }

You can find the full implementation on Github at Network.prototype.feed() and Network.prototype.backpropagate()

Training

As mentionned above, I used WebWorkers API to perform neural network training in a separate thread, with the goa to avoid blocking the page during processing. Since the memory isn’t shared and I don’t use a SharedWorker , the entire neural network and training dataset has to be copied:

Copying parameters and training dataset to the worker Recreating the neural network in the worker Training on the dataset Copying back parameters if modification, weights and bias Main neural network update with new parameters, weight and bias.

////////////////////// Main thread: // Start web worker with training data through epochs worker.postMessage({ params: this.exportParams(), weights: this.exportWeights(), biais: this.exportBiais(), training_data: training_data, epochs: epochs }); ////////////////////// Worker: // Create copy of our current Network var brain = new Network(e.data.params); brain.weights = e.data.weights; // ... // Feedforward NN for (curr_epoch = 0; curr_epoch < epochs; curr_epoch++) { for (sum = 0, i = 0; i < training_size; i++) { brain.feed(training_data[i].inputs); brain.backpropagate(training_data[i].targets); sum += brain.outputError; } global_sum += sum; mean = sum / training_size; global_mean = global_sum / ((curr_epoch+1) * training_size); // Send updates back to real thread self.postMessage({ type: WORKER_TRAINING_PENDING, curr_epoch: curr_epoch, global_mean: global_mean, }); } /////////////////////// Main thread: // Training is over: we update our weights and biais if (e.data.type === WORKER_TRAINING_OVER) { that.importWeights( e.data.weights ); that.importBiais( e.data.biais ); // Feeding and bping in order to have updated values (as error) into neurons or others that.feed( training_data[0].inputs ); that.backpropagate( training_data[0].targets ); }

You can find the full implementation on Github at Network.prototype.train() and Network.prototype.workerHandler()

Next Step

As said at the beginning of the article, neural networks are for me a logical continuation of genetic algorithms and it would be interesting to determine the optimal neural network hyperparameters of by applying a genetic algorithm on it.

The next step is to modify this implementation to model a recurrent neural network, with an example to support it. Stay tuned!

