In the inference layer, we didn't have a need for outputs once the signal passes the layer. Now, we do. First, we need to keep \(z^l\) around. As the relevant equations suggest, we also need access to the output of the previous layer, \(a^{l-1}\), and the error propagated from the subsequent layers.

We will work out the details in the following articles, but, for now, we can see that each training layer should have its own reference to z .

We create the Backprop protocol with methods for moving forward and backward, and for accessing the activation. Since, by design, all layers in the network operate on the same batch size, we can reuse the vector of ones by propagating it from each layer to the next during the construction.

( defprotocol Backprop ( forward [ this ] ) ( backward [ this ] ) )

( defprotocol Transfer ( input [ this ] ) ( output [ this ] ) ( ones [ this ] ) )

The major novelty in FullyConnectedTraining is how we treat the input and output matrices. In the inference layer implementation, these were function arguments unrelated to the layer object. In the training layer, they become part of the layer.

Instead of implementing invoke , we implement forward , while most other differences are related to bookkeeping.

( deftype FullyConnectedTraining [ w b a-1 z a ones-vctr activ-fn ] Releaseable ( release [ _ ] ( release w ) ( release b ) ( release a-1 ) ( release z ) ( release a ) ( release ones ) ) Parameters ( weights [ _ ] w ) ( bias [ _ ] b ) Transfer ( input [ _ ] a-1 ) ( output [ _ ] a ) ( ones [ _ ] ones-vctr ) Backprop ( forward [ _ ] ( activ-fn ( rk! -1.0 b ones-vctr ( mm! 1.0 w a-1 0.0 z ) ) a ) ) ( backward [ _ ] ( throw ( ex-info " TODO " ) ) ) )

The activation function is not allowed to overwrite the argument, so we create a two-argument version. The input x ( z ) is unchanged, while the output y ( a ) is overwritten with the result.

( defn sigmoid! ( [ x ] ( linear-frac! 0.5 ( tanh! ( scal! 0.5 x ) ) 0.5 ) ) ( [ x y ] ( linear-frac! 0.5 ( tanh! ( scal! 0.5 ( copy! x y ) ) ) 0.5 ) ) )

Constructors have some interesting details. I decided that the training layer is going to be wrapped around an inference layer, a sort of an "attachment" to it. Once the training layer trains the parameters, we can dispose it, and continue using the lighter inference layer, without it knowing how these weights and biases were determined.

( defn training-layer ( [ inference-layer input ones-vctr ] ( let-release [ w ( view ( weights inference-layer ) ) b ( view ( bias inference-layer ) ) a-1 ( view input ) z ( ge w ( mrows w ) ( dim ones-vctr ) ) a ( raw z ) o ( view ones-vctr ) ] ( ->FullyConnectedTraining w b a-1 z a o ( activation-fn inference-layer ) ) ) ) ( [ inference-layer previous-backprop ] ( training-layer inference-layer ( output previous-backprop ) ( ones previous-backprop ) ) ) )

w and b are just views of the same underlying memory from the inference layer. view is a polymorphic Neanderthal function that creates a default Neanderthal structure that reuses the underlying memory of the argument. In this case, we use it to create additional instances of matrices, that operate on the same data, while having a separate life-cycle. Releasing views does not release the buffers in the "master" structure.

The reference a-1 is a view that can read and write data from a of the previous layer, but when we release a-1 , that does not affect the previous layer. Of course, if we released the previous layer, the layer at hand will raise an error if we tried to use its a-1 .