The network obtains a forward accuracy of 99.86% on the train set (containing 39900 images) and 97.43% on the dev set (containing 2100 images). These results are alright! I mean on MNIST dataset, you can do that easily. But where the network shines is in generating the digits back from the same set of weight values. Take a look:

(a) Original input images. (b) Images generated by the network for those digits

I feel that there should be some metric that allows us to calculate a score for how well the network could generate the images in the backward direction. So, only looking at the accuracy should not be it.

There are some more key observations here, but first take a look at this video of the digits visualisation that I made:

Visualization created to illustrate the process of digit generation

The first graph is for the vectors that are fed into the network with the linear interpolation of values in the the range [0–80]. The second graph is the activations generated on the last but one layer in the forward direction and the third one is the actual digit generated by the network.

Note how simple it is to feed in representation vectors into the network in order to generate the digits. (No sampling from a random distribution required). Next is that the generated digits transform from one form to another (while being what they are) smoothly. So, the network has learnt a differentiable function from the representations to the images, which means that these are not simple input output mappings (which is the case for a simple forward fully connected network). Also, notice the representations at the last but one layer (middle graph of the video). They correspond to the typical representations that we obtain using a traditional AE (for feature extraction).

This concludes the explanation of the AANN technique. I would like to mention a few more points about the activation function used for this network and would also like to make some final comments regarding the future scope of it.

Was it so simple to derive?

Well, the answer is both yes and no. The cost function definition was quite lucid given all the already done work on the AEs. I had the direction-magnitude trick for penalizing the cost function of an AE in mind for a long time. However, making this cost function work was the difficult part.

This architecture didn’t work directly for the first time (in fact, I tried at least 25 different models before I found the above explained model). I realised very soon that it is the activation function used in the network that is keeping the network from adjusting itself to minimise both the costs. As it turns out, using the Absolute valued function as the activation function for the network allows it to optimize this hybrid objective function. (I mentioned above positive real number ranges. This is the reason why the activations are always positive. We are using the absolute function as the activation function). And, hence the name: “Absolute Artificial Neural Network”. This was the difficult part, as I mentioned, because, it is not one of the standard activation functions used for the Neural Networks. I thought of this function since I was trying to create a symmetric ReLU.

Experimentation with the activation functions

These are some of the observations that I made while trying out the available activation functions and that is how I finally concluded with the use of abs function. (a) Upon using the ReLU, i.e. Rectified Linear-Unit, function as the activation function for this architecture, all the activations shoot to nan in the forward direction leading to proliferation of nan in the reverse direction as well (gradients exploding). If the Linear activation function is used, the network performs poorly in the forward direction, leading to very high classification error rates, while, the network converges to the point that it outputs the same structure as shown in (b), for every possible representation vector. On activating the hidden neurons with a ReLU in the forward direction and with an Abs in the reverse direction, the network kills all the activations, i.e. outputs the zero vector for every input, in the forward direction. In the backward direction, the network converges to the structure shown in (c). Upon using the Abs function in the forward direction and the ReLU in the backward direction, the network this time kills all the activations in the backward direction as visualized in (d). The (e) in above figure is the output achieved by using the Sigmoid activation function in the network. The result obtained is very similar to the result of using Linear activation function, as in (b).

I would like to especially highlight the case where we use ReLU in the forward direction and Abs in the backward direction. This is what lead me to the use of absolute function everywhere. Firstly, by using ReLU forward and linear backward generated some grey coloured images, which I knew are caused by negative values. So, I thought how about I use the abs function to visualize what is getting generated in the backward direction. When I did this, the network converged to a point (forward cost decreased and backward cost increased) that the network outputted zero vectors for all the inputs in the forward direction. In fact, this convergence was so strong that when I tried to train the network only in the forward direction with ReLU, the weights didn’t move. All the gradients vanished. This is something that I am still trying to understand why such a phenomenon occurs. Anyway, this lead me to try Abs in the forward direction as well, and that’s it. It worked!