Deep learning model performance has taken huge strides, allowing researchers to tackle tasks which were simply not possible for machines less than a decade ago. Nevertheless the theoretical framework supporting these improvements hasn’t advanced as much as the models’ empirical performance, and pesky questions remain, particularly: What exactly happens inside a deep neural network during training? In the paper Opening the Black Box of Deep Neural Networks via Information, Schwartz-Ziv and Tishby leverage Information Theory to explore Deep Neural Network training.

Synced invited Joaquin Alori, a Machine Learning Research Engineer at Tryolabs with a focus on object tracking, pose estimation, and person re-id problems, to share his thoughts on this paper.

How would you describe this paper?

In Opening the Black Box of Deep Neural Networks via Information, Schwartz-Ziv and Tishby provide insights on the process of Deep Neural Network training by looking at it through the eye of Information Theory.

For their analysis they take small, fully connected neural networks and consider each whole layer as a single random variable. They then calculate the mutual information of each layer with regards to the input data to the network, and with regards to the label data the network is fitting. They plot these two numbers in a 2D diagram they call the information plane:

The colors correspond to the layer each point belongs to in the first plot, and to the epoch each point belongs to in the second. From this they obtain some very big insights.

First, there are two main distinct phases a neural network goes through during supervised training: an initial phase called Empirical Error Minimization and a subsequent phase called Representation Compression.

During Empirical Error Minimization, each layer starts increasing its mutual information with regards to the inputs and also its mutual information with regards to the labels. This seems quite intuitive and the authors don’t spend much time analyzing this phase. On the other hand, after this phase is done, the network goes through a new, much longer phase called Representation Compression, in which the layers in the network continue to increase their mutual information with regards to the labels, but start decreasing their mutual information with regards to the inputs to the network. This is quite astonishing as it shows that not only it is important for layers to be able to ignore unimportant information encoded in the inputs they receive, but also that the phase in which they start doing this compression of the irrelevant data occurs later during the training and can be clearly seen by drawing simple plots.

Second, to gain more insight into the two training phases, the authors plot the normalized mean and standard deviation of the network’s gradients for every layer as a function of the training epochs:

Again, there are two clearly demarcated phases. An initial phase in which the gradient means are much larger than their standard deviations, indicating small gradient stochasticity; and a subsequent phase in which the gradient means are very small compared to their batch to batch fluctuations, with the gradients behaving like Gaussian noise with very small means. They call the initial phase the Drift Phase, and the second phase the Diffusion Phase. Interestingly the transition between these two phases corresponds to the transition between the Empirical Error Minimization and Representation Compression phases previously mentioned. The authors claim that the noise introduced in the second phase leads to more compressed representations of the input data in each layer we see during the Representation Compression phase.

What impact might this research bring to the research community?

This new way of thinking about neural network training can be used to jumpstart several new areas of research. Just as a side note, in this paper the authors deduce that:

Adding hidden layers dramatically reduces the number of training epochs for good generalization, or in other words, the compression representation phase takes much longer.

The compression phase of each layer is shorter when it starts from a previous compressed layer.

The compression occurs faster in the deeper layers.

Can you identify any bottlenecks in the research?

The authors tested their results on two very particular neural network architectures, it’s still unknown if they generalize to other architectures such as convnets, recurrent networks, or even non DNNs, though it seems likely. Also, the authors indirectly verified their findings on the MNIST dataset. It is still left to confirm whether they generalize to larger datasets such as ImageNet, though again, this seems very likely.

Can you predict any potential future developments related to this research?

The most important future development for this area of research is the question of what are the practical implications of the findings. The authors explain that they are currently working on new algorithms that incorporate their findings. They argue that SGD seems like an overkill during the diffusion phase, which consumes most of the training epochs, and that much simpler optimization algorithms may be more efficient.

The paper Opening the Black Box of Deep Neural Networks via Information is on arXiv.