No one really knows how neural networks work, but a ‘DeepMath’ group of mathematicians, physicists, neuroscientists and engineers are trying to figure out how to decipher their powers.

The first DeepMath conference took place in New York in October. Ben Gierig Photography

It can be hard to avoid deep neural networks and their profound influence on the modern world. From software that can generate realistic videos of anyone you want to programs that beat the world’s best players game after game, deep networks can do things that even five years ago seemed impossible. But a major mystery remains: How do they work? Originally inspired by biological neurons, deep networks are organized into layers of artificial neurons, where each layer combines information from the previous layer in new ways, so that it represents a more complex picture of its input while being simpler for the next layer to mine for a decision. For instance, early visual neurons may respond to bright or dark spots, but later visual neurons can put these together to make lines of bright spots and, eventually, with increasing detail, a face or other completed picture. A key feature of neural networks is their ability to learn. They are given data, such as a collection of cat photos, and trained to produce a desired output, such as to correctly identify cats. During training, errors are sent back through the network, layer by layer, subtly adjusting its connections to minimize future mistakes. The process repeats in a step-by-step fashion until the network provides mostly correct answers. Despite neural networks’ rapid rise in computing, we still don’t understand the specifics of how they work. How does each layer transform information from the previous layer to eventually determine that green boats and blue boats and sailboats and motorboats are all boats? How do networks stretch and contort to fit the new data? Deep networks often have many more parameters — more nobs to fiddle with — than there is data available. In theory, a network could simply memorize the data by assigning each nob to a different data point or image. Then it would easily classify the data it is trained on but perform horribly on any new data it has not already memorized. That doesn’t happen, though: Deep learning does generalize to new data, and no one is sure why. At this point, we don’t even know the best way to tackle the ‘how it works’ questions: What type of math or what level of description should we use? Should we look at one neuron at a time? One layer at a time? Should we examine the statistical properties or distributions? Should we study them like computer scientists studying algorithms? Like physicists modeling systems? Or like engineers probing signals?

Last October, physicists, mathematicians, neuroscientists and computer scientists gathered in New York City to tackle some of these questions at a new conference called DeepMath. Organizers hope to encourage multidisciplinary approaches to uncovering the theoretical underpinnings of deep networks in order to better understand these systems. A mathematical framework for how neural networks operate will provide clear benefits not just to machine learning researchers, who can use it to decide precisely how much data to use or what parameters to set for the best performance, but also to neuroscientists, who hope that the inner workings of artificial networks will shed light on how biological networks produce the vast array of behaviors we see every day. For all the differences between silicon and cells, brain networks and machine networks are trying to solve a similar problem: taking in input, transforming it layer by layer, synapse by synapse, and creating the correct output. Deep Manifolds and Bottlenecks Just as biological networks have inspired artificial neural networks, researchers are borrowing from neuroscience to decipher the brain’s computer counterpart. One of neuroscience’s basic approaches is to investigate the brain one area at a time, such as the retina, the visual cortex or the motor cortex. A natural first step to understanding deep networks is to take a similar approach and ask what is happening at each layer. Haim Sompolinsky, a theoretical neuroscientist at the Hebrew University of Jerusalem and Harvard University, thinks that layers in deep networks slowly separate different concepts in the data so that it is easy for the network to distinguish objects that are essentially compositions of these concepts. For example, the network might learn that the concept of posture is a good way to group like animals; two people will likely have a more similar posture than a person and a lizard.

Images from the same conceptual categories (dog and cats) are represented closer to each other early in a neural network and become more 'separated' in later layers. Cohen et al. Nature Communications 2020

By examining the responses of each layer to different concepts, Sompolinsky and colleagues were able to come up with an equation that relates the number of neurons in a layer to the maximum number of concepts that can be easily separated in that layer. By peering into these networks, Sompolinsky and his colleagues found that most learning takes place in the final layers of the network. This layer-by-layer approach, which Sompolinsky described at the DeepMath conference and which was published in Nature Communications in February, allows researchers to understand the effect of different features added to a deep network, such as whether the outputs of one layer are ‘pooled’ together. They find that these features change how consistently each layer can identify different instances of the same object. Naftali Tishby at Hebrew University is working on a complementary approach — how networks decide what information to keep and what to pass on at each layer. As networks process data, irrelevant information gets successively squeezed out, an idea known as the ‘information bottleneck.’ For an individual network, researchers can deduce equations that describe how much information to keep at each layer and how much to discard. It is surprising that this happens — networks could instead greedily retain all the information in the data, using the relevant information for the answer only at the very last layer. But it seems that “without any additional constraints, or tweaks or tricks,” standard deep neural networks can “lose all other unnecessary information, layer by layer,” Tishby says. From Very Small to Very Large In physics, it is common to stretch problems to their limits, making them either very small or very large. “Everyone says neural networks are complex, so where do you start?” asks Yasaman Bahri, a research scientist at Google Brain. “Often you start by taking the limits and perturbing from there.” Stanford University neuroscientist Surya Ganguli, an investigator with the Simons Collaboration on the Global Brain, and colleagues took the first approach — making the problem small. They studied a simple, single-layer network and asked how it learns. Does it learn each part of the data at the same time, slowly improving? Or do these networks learn certain things first? The simplicity of the single-layer network allows them to use mathematical tools to show that the network learns patterns in the data sequentially. First, the network learns the pattern that explains the most variability in the data; then it learns the pattern that explains the second most variability; and so on (termed the ‘singular modes’ of the data). It is as if the network is chasing after the next pattern, then the next. The modes that explain the least about the data — the noise — are learned at the end, which explains a common problem during training. When trained too long, networks will overfit on the data and learn about the noise. Predictions of precisely when simple networks will learn each node match up very well with training in the more complex networks, validating the approach. Bahri has taken the converse approach: examining networks that are extremely deep. Instead of a one-layer network, or even a 10-layer network, she asked if it is possible to train a 10,000-layer network. The neural network community has relied on specialized architectures to train these networks, with connections designed to pass information across long distances. Bahri and colleagues found that this architecture is unnecessary. They use a theoretical approach from physics known as ‘mean field theory’ to describe how deep networks propagate information throughout their layers. Using this approach, they came up with a way to choose the initial connections that allow these extremely deep networks to be trained. This theoretical approach also enables them to identify the maximum number of layers through which a signal can propagate, a fundamental limitation on the depth of these networks. This type of theory may turn out to be relevant to biological neural networks (such as the brain) that have many, many ‘layers’ that are ‘initialized’ by development encoded in the genome. Bahri and collaborators have also studied networks with an infinite number of neurons in a layer. Though computers cannot simulate an infinite number of neurons, there are algorithms that can predict what a network with infinite neurons would do. With these infinite networks, they can come up with an equation that predicts how the output of the network should change at each step of the network’s training. These equations also hold in smaller networks. Sanjeev Arora, a computer scientist at Princeton University, has also been studying these infinitely wide networks. Training them on a set of images, he found that the networks were able to identify new images just as well as other machine learning methods. This seems puzzling: If the network has infinite parameters, why didn’t it immediately memorize the training data as well as the noise? Most machine learning methods require an extra constraint (termed ‘regularization’) that smooths out the answers to prevent the network from overfitting. But this smoothing comes automatically with these networks. It turns out that deep networks are mathematically equivalent to a machine learning method with this extra constraint. An important idea that both Arora and Bahri have found by studying these different extremes is that the traditional separation of the network’s objective — what it is trying to learn — from the training process — how the network adapts its connections to carry out that objective — is misleading. The network’s learning process and its final goal both emerge from the trajectory it follows during training. Indeed, explicitly studying the trajectory during training can reveal new and surprising ways to train networks. Most training protocols slow down the learning process over time, so that the network doesn’t take a step too far and accidentally ‘skip over’ the optimal solution. However, in a presentation at the DeepMath meeting and in a prior preprint, Arora and collaborator Zhiyuan Li showed an algorithm that does the opposite, exponentially increasing the signal that drives learning. The algorithm can still effectively train deep networks, despite the networks being driven more and more strongly over the course of training. This flies in the face of common knowledge about how these networks should be trained, and exemplifies how identifying appropriate mathematics — that of the trajectory of learning — improves our understanding of how neural networks operate. What is important is not the objective but the trajectory of how you get there.

Developing a theoretical framework for deep networks might aid in their use in neuroscience.