Today’s paper offers a new architecture for Convolution Networks. It was written by He, Zhang, Ren, and Sun from Microsoft Research. I’ll warn you before we start: this paper is ancient. It was published in the dark ages of deep learning sometime at the end of 2015, which I’m pretty sure means its original format was papyrus; thankfully someone scanned it so that future generations could read it. But it is still worth blowing off the dust and flipping through it because the architecture it proposes has been used time and time again, including in some of the papers we have previously read: Deep Networks with Stochastic Depth.

He et al. begin by noting a seemingly paradoxical situation: very deep networks perform more poorly than moderately deep networks, that is, that while adding layers to a network generally improves the performance, after some point the new layers begin to hinder the network. They refer to this effect as network degradation.

If you have been following our previous posts this won’t surprise you; training issues like vanishing gradients become worse as networks get deeper so you would expect more layers to make the network worse after some point. But the authors anticipate this line of reasoning and state that several other deep learning methods, like batch normalization (see our post for a summary), essentially have solved these training issues, and yet the networks still perform increasingly poorly as their depth increases. For example, they compare 20- and 56-layer networks and find the 56-layer network performs far worse; see the image below from their paper.