The network returned a vector of ones because the output activation is the sigmoid function. Since the expected values are larger than 1, the output is saturated. Sigmoid is often spotted in the output layer of neural networks in various tutorials. That's because most tutorials start with classification examples, and often deal with classification of photos. There, the network usually has as many output neurons as there are categories that the output gets classified into, and it is expected that one neuron has a value close to one, while the others are closer to zero. Here, however, we are doing a different kind of task: regression.

In our case, there is only one neuron in the output, and it should directly return the value of the approximation. We do not want to mess up with the signal at the output, and do not need to do any activation there. Since we still need to fit that functionality into the existing architecture, we create a kind of do-nothing activation, similar to Clojure's identity function. The derivative of this linear function is a constant one.

( deftype LinearActivation [] Activation ( activ [ _ z a! ] ( copy! z a! ) ) ( prime [ this z! ] ( entry! z! 1 ) ) ) ( defn linear ( [] ( fn [ _ ] ( ->LinearActivation ) ) ) ( [ z! ] z! ) )

We fix the network and repeat the process.

( def inference ( init! ( inference-network native-float 4 [ ( fully-connected 16 sigmoid ) ( fully-connected 64 tanh ) ( fully-connected 8 tanh ) ( fully-connected 1 linear ) ] ) ) ) ( def training ( training-network inference x-train ) )

Checking the inference on the untrained network, we, unsurprisingly, get useless answers.

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 0.51 0.51 0.51 0.51 0.51 ┗ ┛

One epoch later, we see that the cost is quite high.

( sgd training y-train quadratic-cost! 1 0.05 )

1.3255932671876625

We repeat the inference, only to see that the network didn't learn much, but it has changed.

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 0.80 0.80 0.80 0.80 0.80 ┗ ┛

Another epoch, and the cost decreased.

( sgd training y-train quadratic-cost! 1 0.05 )

0.9166161838265136

As expected, the inference is still bad.

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 1.05 1.05 1.05 1.05 1.05 ┗ ┛

Is something like 10 epochs enough to see some improvement?

( sgd training y-train quadratic-cost! 10 0.05 )

0.11156441768060976

Hooray, now the loss decreased 10 times! How's the inference doing?

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 2.02 2.02 2.02 2.02 2.02 ┗ ┛

It doesn't seem to be any better. Let's do a 100 epochs more.

( sgd training y-train quadratic-cost! 100 0.05 )

0.10893812269722111

The loss doesn't seem to go much lower. The inference is still bad.

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 2.07 2.07 2.07 2.07 2.07 ┗ ┛

Maybe the learning rate is too big. Let's decrease it a bit.

( sgd training y-train quadratic-cost! 100 0.03 )

0.10892282792763508

The loss seems to stay at the same level, and the inference hasn't improved.

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 2.07 2.07 2.07 2.07 2.07 ┗ ┛

I'll try with 1000 epochs, and yet lower learning rate.

( sgd training y-train quadratic-cost! 1000 0.01 )

0.10887324749642284

It hasn't helped at all.

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 2.07 2.07 2.07 2.07 2.07 ┗ ┛

Maybe I need to vary the learning rate a bit. Let's try that.

( sgd training y-train quadratic-cost! [ [ 100 0.03 ][ 100 0.01 ][ 100 0.005 ][ 100 0.001 ] ] )

(0.10885866925871306 0.10885377446494822 0.10885131820990937 0.10885081984270364)

We can see that, as the learning progresses, the cost stays roughly the same, which means that the network just strolls around, but can't progress much.

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 2.07 2.07 2.07 2.07 2.07 ┗ ┛

Before throwing the towel, let's remember that the task that we are doing here is not classification, when it is enough that the network learns to discriminate between a few, or several, discrete categories. Here we are doing regression, which is more difficult, since the network has to learn to approximate the actual real value of the function. Maybe I need to give it a more time. Let's see what it can do with 40000 epochs.

( time ( sgd training y-train quadratic-cost! 40000 0.05 ) )

"Elapsed time: 116679.184528 msecs" 0.002820095415000833

Now the cost is significantly lower. Let's hope that it can be directly seen when we test the inference.

( inference x-test )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → 2.23 2.12 3.28 2.41 2.12 ┗ ┛

Right! Much closer to the real values. We can never expect to get the exact floating point values that the real function is returning, especially not with the test observations that the network hasn't seen during the learning phase, but the difference is within an acceptable range.

( axpy! -1 y-test ( inference x-test ) )

#RealGEMatrix[float, mxn:1x5, layout:column, offset:0] ▥ ↓ ↓ ↓ ↓ ↓ ┓ → -0.05 -0.03 -0.10 -0.11 -0.07 ┗ ┛

If we wanted to improve the approximation, we should probably train the network for longer. However, do not assume that more training leads to better approximation. As the learning progresses, the network will generally decrease the cost, but after some time, some local optimum is reached, and the cost may oscillate, or even start to increase. There is no guarantee when or if the network will reach some optimal state.

Fortunately, we do not even want to decrease the cost too much. In practice, that might indicate overfitting. The network that is optimized for the training data too much, might work poorly on the data that it hasn't seen during the learning process, and this is exactly the data that we want it to work well with.

These are high-level things to worry about. For now, it is enough to see that our network works, and to get a feeling of how difficult the task of training is. We needed a huge number of epochs to get acceptable results, and may need even more to get something good. And we have a really tight implementation, without much resource waste. Imagine how long it would take with something that was less optimized.