So far, we have seen how a neural network is a series of linear transformations interposed with non-linear activations.

Here's simple example, an one layer network we used in part 1, Representing Layers and Connections:

Take a look at the formula for the linear transformations that we defined in that article:

\(\mathbf{h} = W\mathbf{x}\)

Each \(h_i\) is a dot product of the respective row of \(W\) and the input.

\begin{gather*} h^{(1)}_1 = w_{11}\times{} x_1 + w_{12}\times{} x_2\\ h^{(1)}_2 = w_{21}\times{} x_1 + w_{22}\times{} x_2\\ h^{(1)}_3 = w_{31}\times{} x_1 + w_{32}\times{} x_2\\ h^{(1)}_4 = w_{41}\times{} x_1 + w_{42}\times{} x_2\\ \end{gather*}

Until now, we have set the initial weights and inputs to be in the range \([0, 1]\), as in the following example that you have seen many times by now.

( with-release [ x ( ge native-float 2 2 [ 0.3 0.9 0.3 0.9 ] ) y ( ge native-float 1 2 [ 0.50 0.50 ] ) inference ( inference-network native-float 2 [ ( fully-connected 4 tanh ) ( fully-connected 1 sigmoid ) ] ) inf-layers ( layers inference ) training ( training-network inference x ) ] ( transfer! [ 0.3 0.1 0.9 0.0 0.6 2.0 3.7 1.0 ] ( weights ( inf-layers 0 ) ) ) ( transfer! [ 0.7 0.2 1.1 2 ] ( bias ( inf-layers 0 ) ) ) ( transfer! [ 0.75 0.15 0.22 0.33 ] ( weights ( inf-layers 1 ) ) ) ( transfer! [ 0.3 ] ( bias ( inf-layers 1 ) ) ) ( sgd training y quadratic-cost! 2000 0.05 ) ( transfer ( inference x ) ) )

nil#RealGEMatrix[float, mxn:1x2, layout:column, offset:0] ▥ ↓ ↓ ┓ → 0.50 0.50 ┗ ┛

As all operands are between 0 and 1, the dot products \(h\) are likely to be in that range, or at least not much larger.

If any of the weights \(w\) or inputs \(x\) are large numbers, \(h\) also has a chance to be large. If the network didn't have non-linear activations, the inputs to the following layer would grow uncontrollably, which would propagate further. Some activation functions, such as ReLU, are linear in the positive domain, so this would propagate to the output. The sigmoid and hyperbolic tangent activation functions would saturate at \(1\) at the upper bound and \(0\) or \(-1\) at the lower bound.

Sorry, your browser does not support SVG.

Although the saturation will contain the inputs to the next layer to the \([-1,1]\) range, it would make the learning difficult, since the saturated functions would have problems propagating the gradients backwards.

We are still using a trivial example, which can easily illustrate this problem (that's why I'm still keeping it, despite it being silly). Just change the weights to be numbers larger than one. Even though we are just chasing one input/output example (\((0.3, 0.9) \mapsto 0.5\)) where there is nothing even remotely challenging to learn, our algorithm gets stuck in the saturation zone right away.

( with-release [ x ( ge native-float 2 2 [ 0.3 0.9 0.3 0.9 ] ) y ( ge native-float 1 2 [ 0.50 0.50 ] ) inference ( inference-network native-float 2 [ ( fully-connected 4 tanh ) ( fully-connected 1 sigmoid ) ] ) inf-layers ( layers inference ) training ( training-network inference x ) ] ( transfer! [ 3 1 9 0 6 20 37 10 ] ( weights ( inf-layers 0 ) ) ) ( transfer! [ 7 2 11 2 ] ( bias ( inf-layers 0 ) ) ) ( transfer! [ 75 15 22 33 ] ( weights ( inf-layers 1 ) ) ) ( transfer! [ 3 ] ( bias ( inf-layers 1 ) ) ) ( sgd training y quadratic-cost! 2000 0.05 ) ( transfer ( inference x ) ) )

nil#RealGEMatrix[float, mxn:1x2, layout:column, offset:0] ▥ ↓ ↓ ┓ → 0.00 0.00 ┗ ┛

It is obvious that we should keep the average absolute value of weights below 1. But, how small should they be? If weights are too small, the signal will be feeble. Feeble signal might not have problems in small networks, but when passed through a large number of layers, it would be dampened before reaching the output. I'd need a larger example to illustrate it, so for this one I'd have to ask you to trust me.

Let's say that 0.001 is not too small, and yet not too large. Why don't we pick a universally good value and set all weights to it? The problem with that approach is that all neurons would behave in the same manner. We wouldn't have a variability in the neurons that is needed for proper learning.

Although there is not a universal best strategy for setting weighs, a few things are certain enough: