Caveat: Some prior knowledge of CNNs is assumed for this post

There are many things that I don’t like about convolution. The biggest of them all, most of the weights particularly in later layers are quite close to zero. This tells that most of these weights haven’t learned anything and don’t help the network process any new information.

So I wanted to modify the convolution op to solve this problem. This blog post highlights the experiments I did in that direction and the results.

Experiment #1

At the core, every 2D convolution operation is a matrix multiplication operation. Where the dimensions of matrix are (kernel_size x kernel_size x in_channels, out_channels), assuming the convolution dimensions are (kernel_size, kernel_size, in_channels, out_channels). For simplicity I will call this matrix convolution matrix in the post with dimensions (m,n).

If we are able to maintain that the columns of the convolution matrix are orthogonal (in a differentiable way), we can ensure that each channel in the output feature map captures information not present in any of the other feature maps. More importantly, this can help us create neural networks where weights are more interpretable.

And yes, there are differentiable ways to make sure of this using some tricks from linear algebra (one is householder transformations, other is givens rotation). I used householder transformation because it was coming out to be much faster on the GPU.

Code is in this file.

The idea is like this:

Instead of keeping all mxn variables trainable in the convolution filter and n in the bias, you generate the filter and bias from another set of trainable variables in a differentiable way.

More concretely, for a convolution matrix of dimensions (m, n), create n vectors of dimensions m each. First vector (say v1) will have m-n+1 trainable variables (padded with n-1 zeros at the beginning), second (say v2) m-n+2 (padded with n-2 zeros), third (say v3), m-n+3 (padded with n-3 zeros) and so on. Next normalize all these vectors. Create n householder matrices using these vectors, multiply vectors in this order v1*v2*v3…*vn. The resultant matrix has dimensions m x m and is orthogonal. Take the first n columns of this matrix. Resultant matrix is of size (m, n). And use it as the convolution matrix.

It may look like that this operation is very time consuming, but in reality for a 3x3x64x128 convolution, means 128 matrix multiplications of 576x576 size matrices. This isn’t much considering that, such convolutions are performed on 256x256 size images (which would mean (256x256)x(3x3x64x128) flops).

If you have to create a bias along with the filter, do the above process with m replaced by m + 1, and from the resultant matrix of size (m+1, n) extract the topmost row and use it as the bias, use the remaining (m, n) matrix as the filter.

And this idea will not work if you are using batch normalization, because then the orthogonality of columns assumption doesn’t hold true.

For these experiments, I used cifar-10 with vgg like architecture. Code is in the same repo.

Results were very very disappointing.