Residual networks won the IMAGENET challenge in 2015 and since then many researchers have tried to develop techniques to train deeper networks successfully. In this post, I am going to discuss some of these developments in the Deep Neural Network(DNN) architectures.

First of all, why do we need new architectures?

1.) Improving results and beating state of the art accuracy or

2.) How to use parameters efficiently to achieve similar or better accuracy with less number of parameters from the earlier models?

3.) To understand how to train more complex( deeper) architectures successfully without overfitting and other issues faced while training deeper networks?

I feel that the new architectures should focus on addressing the last two questions. In this post, I am summarizing some of the recent architectures which were published this year and if combined with (Dense-Sparse-Dense)DSD training I believe the accuracy of all of these models can be improved further. Most of the figures by the authors are self-explanatory, but I will discuss the key findings and observations from their experiments.

1.)Residual Networks of Residual Networks(RoR): Multilevel Residual Networks

In this work, the authors add extra shortcut connections between different residual blocks and experiment on CIFAR dataset to find out the optimum number of shortcut connections which lead to best results. The authors find out that adding three identity connections and one parent(root ) identity connection gives the best accuracy on CIFAR model. Though it looks like these are just simple modifications they have relation to more detailed discussion provided in the Fractalnet paper. The authors also explore different variants of such structure and experiment on the wider networks. The wider version of RoR has the lowest test error rate on CIFAR-10 and CIFAR -100 datasets with less number of parameters when compared to original wide residual networks.

The models in this work are trained using stochastic depth drop-path method discussed below.

The key contribution of this work is improving the results with just a few extra connection without increasing the number of parameters significantly.

2.) Densely Connected Convolutional Networks:

Here every layer is connected to the subsequent layers in the dense block thus if a block has n convolutional layer that block n*(n-1)/2 connections. The features from each of the previous layers are thus concatenated in the current layer. There are different blocks in the model for each feature dimension. This is done because feature maps of different scales can not be concatenated.

The authors also emphasize that using dense blocks features are reused and hence the redundancy in the parameters is relatively less. They show that DenseNet with less number of parameters outperforms the ResNets which have more parameters. They also perform experiments to see if the concatenated features from earlier layers are used in the deeper layers.

3.) Swapout: Learning an ensemble of deep architectures

Swapout can be viewed as a novel stochastic training method and a regularizer. While in dropout a neuron is dropped randomly(output is set to zero) and ins stochastic depth, layers are skipped randomly, swapout averages a large set of architectures that include all architecture used by dropout and stochastic depth. As is evident from the above figure, if x is the input to a layer and F(x) the output, the output of each neuron in that layer can be anything listed below:

1.) dropped(0),

2) a feed forward unit (F(x)),

3) skipped (x) or

4.) a residual connection(x+F(x))

Thus swapout extends the set of architectures used in stochastic depth and dropout.

During inference, one can use either deterministic inference in which random variables are replaced with expected value or stochastic inference where several members are sampled at random and average over the result. The authors show that the stochastic inference only requires few samples for a good estimate and outperforms the deterministic inference. Also, one has to choose the stochastic parameters (θ1 and θ2 from the figure above) intelligently. The linear(1,0.5) choice from the author works best where linear(a,b) means linear interpolation from a to b from the first block to last in the network.(see the figure below)

4.) FractalNet: Ultra-Deep Neural Networks without Residuals

In this work, the authors show that the residuals are not essential for learning deep architectures instead to learn a deep architectures one needs to provide a shorter alternate path for gradient and information flow. They train using drop path and thus train multiple networks simultaneously similar to stochastic depth. Thus the network trained here is an ensemble of several deep and shallow networks. The fractal network can be described using the number of columns and the number of blocks in the network. The network requires the different block to generate different dimensional feature maps using pooling and each block is a fractal network with n columns.

During training the author use two type of drop paths,(local) in which each input can be dropped with fixed probability(Iteration 1 and 3 in the figure above) and global in which the path is restricted to a single column, thus promoting individual columns to be trained as independent networks.(Iteration 2 and 4 in the figure above). So for faster inference one can just use the shallow column and if better results are required deeper networks can be used.

The authors also show that this is a form of teacher-student network. Thus an individual network of the same size performs worse than the network trained in the fractalnet.

5.) Convolutional Residual Memory Networks

This model implement another way to reuse features. The features are fed to an LSTM unit which learns to remember important features for the task. Though the improvement here comes at the cost of an increase in the number of parameters, it suggests a novel way to provide an alternate path for gradient and information flow and is open to researchers to explore more variants of these type.

6.) RESNET IN RESNET: GENERALIZING RESIDUAL ARCHITECTURES

In this work, the authors suggest two streams: residual which is similar to ResNet but takes additional input from the transient stream which is similar to a feedforward CNN with one to one connection. This is yet another solution which relies on feature reuse and authors compare their work to LSTMs.

All the papers discussed above experimented on CIFAR datasets and showed improvement over the previous state of the art(Wide ResNets). Since some of these were published recently and at similar timeline, they have not compared results with all the others. But as of now DenseNet reported the best results on CIFAR 10 and CIFAR 100 datasets.

Key takeaways:

Alternate pathways, not the residuals are key to training deeper networks.

Methods such as drop paths(Stochastic Depth) and swapout learn the ensembles of a large set of architectures and thus perform better.

Intelligent feature reuse leads to efficient use of parameters

References:

1.) Residual Networks of Residual Networks: Multilevel Residual Networks. Ke Zhang, Miao Sun, Tony X. Han, Xingfang Yuan, Liru Guo, Tao Liu

2.) Densely Connected Convolutional Networks. Gao Huang, Zhuang Liu, Kilian Q. Weinberger

3.) Swapout: Learning an ensemble of deep architectures. Saurabh Singh, Derek Hoiem, David Forsyth

4.) Deep Networks with Stochastic Depth. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Weinberger

5.) FractalNet: Ultra-Deep Neural Networks without Residuals. Gustav Larsson, Michael Maire, Gregory Shakhnarovich

6.) Convolutional Residual Memory Networks. Joel Moniz, Christopher Pal

7.) Resnet in Resnet: Generalizing Residual Architectures. Sasha Targ, Diogo Almeida, Kevin Lyman

PS: Even though I mentioned earlier in this post that these networks, when combined with DSD training, may improve results but I realized that DSD was not tried on the networks trained with stochastic depth, drop path and swapout methods. Since these methods also reduce the parameter redundancies we can not yet predict without some experimentation whether combining with DSD will show any improvement. Also, I might have missed some of the new DNN architectures please share the links in the comment so that I can add more to this list.