The type of learning that we are using here is called supervised learning. Supervised leaning algorithms work by showing the network the examples of data (inputs) and labels (desired outputs), and adjusting network weights and biases so that the network approximates the outputs for these inputs. This iterative process is called training. If the data examples used in the training represent general data well, we hope that the network will we able to correctly approximate even the instances that it has never seen.

But, how exactly do we adjust the weights? Surely not by blindly trying every possible combination; that is intractable. The most common method for teaching neural networks is gradient descent.

We begin by choosing weight values, usually at random, taking one example from the training data \(x\) and calculating the network output \(a\). Since the weights are random, the output will typically be far away from the result \(y\) from the training data. If we pick some function that can nicely measure that distance, we will be able to tell how far away these two values are. In deep learning terminology, this function is called the cost function (also called the loss function, and sometimes the objective function).

\(C(w,b) \equiv \frac{1}{2n} \sum_x \lVert \mathbf{y}_x - \mathbf{a}_x \lVert ^2\)

We calculate the distance between network outputs and expected outputs for each training example \(x\), average it, and get the number that measures the loss. This particular function is also called quadratic cost, or mean squared error.

So, we know the right answer \(y = f(x)\), we initialized the network weights so that it calculates \(a = g(x)\), we can measure the distance between \(a\) and \(y\), and would like to change weights so that \(a\) becomes as close as possible to \(y\) on average.

In other words, we want to minimize the cost function. This is nice, since instead of "learning", an abstract and magical concept, we are talking about optimization, a subject that is rather well defined and has a clear mathematical background.

Some optimization problems can be solved analytically with precision, some can be explicitly approximated within error range, but neural networks are too large for that. For example, minimum(s) of a function can be found by finding where the first derivative of the function is zero. But, given the size of neural networks, solving that is intractable. The methods that work with more complex problems are usually iterative. They explore the space step by step, in each iteration taking some direction that, hopefully, leads to better solutions. What we usually can't say with these methods is whether we reached the best solution.

Gradient descent is a method that, at each step, takes the direction down the steepest slope. That direction is pointed out by the first derivative of the function. The first derivative of a function defines the rate of change of the function at a particular point. Explaining this background in detail is out of scope of this article, but if you need some, look for some nice text on calculus.

Here is a simplified depiction of gradient descent with a scalar function.

Sorry, your browser does not support SVG.

Note that it didn't find the global minimum to the left, but a local one closer to the starting point. There are ways to make this exploration more robust, but there is no way to guarantee the discovery of the global minimum. Luckily, it turns out to not be such a big problem in neural networks. Modern implementations are good enough.

Since we are dealing with vectors, instead of scalars, we use gradient, which is a vector of partial derivatives of C, practically a kind of "vector derivative".

\(

abla{C} \equiv (\frac{\partial{C}}{\partial{w_1}},\frac{\partial{C}}{\partial{w_2}},\dots)\)

So, we start from a point, move by \(\Delta{C}\approx

abla{C}\Delta{w}\), and we want to make sure that we are going down (\(\Delta{C}\) is negative), taking the best route \(\Delta{w}\). I'll spare you from the details (explore that in the recommended literature), but it can be shown that the best route is \(\Delta{w} = -\eta

abla{C}\), which ensures that \(\Delta{C}\approx -\eta

abla{C}

abla{C} = -\eta

abla{C}^2 \leq 0\). Since \(\Delta{C} \leq 0\), the algorithm is guaranteed to go in a good direction.

The size of the step we take, \(\eta\), is called the learning rate. Choosing a good learning rate is important. The gradient only shows the best direction at infinitely small distanc.e The larger the step, the greater the chance that the gradient at that distance changes. Imagine descending from a peak of the mountain and trying to reach the valley. If we make superhuman jumps, we will jump over the valley and land on another peak. If the step is too small, we would need too many steps to reach the bottom of the valley. As many other things in deep learning such as deciding on the structure of the network, it is a craft rather than science. We don't have a formula, and rely on experience, trial, and error.