As recently pointed out by, the bitter lesson of AI is that flexible methods so far have always outperformed handcrafted domain knowledge in the long run. Search-based methods of Deep Blue beat strategies attempting a deeper analytic understanding of the game, and DNNs consistently outperform handcrafted features used for decades in computer vision. However, flexibility alone cannot be the silver bullet. Without the right (implicit) assumptions, generalization is impossible (). While the success of deep networks on narrowly defined perceptual tasks is a major leap forward, the range of generalization of these networks is still limited. The major challenge in building the next generation of intelligent systems is to find sources for good implicit biases that will allow for strong generalization across varying data distributions and rapid learning of new tasks without forgetting previous ones. These biases will need to be problem domain specific. Because biological brains excel at so many relevant real-world problems, it is worthwhile to ponder how they can be used as a source for good inductive biases. In the following, we lay out a few insights and ideas in this direction.

With the help of deep networks, it is now possible to solve some perceptual tasks that are simple for humans but used to be very challenging for AI. The so-called ImageNet benchmark (), a classification task with 1,000 categories on photographic images downloaded from the internet, played an important role in demonstrating this. Besides solving this particular task at human-level performance (), it also turned out that pre-training deep networks on ImageNet can often be surprisingly beneficial for all kinds of other tasks (). In this approach, called “transfer learning,” a network trained on one task, such as object recognition, is reused in another task by removing the task-specific part (layers high up in the hierarchy) and keeping the nonlinear features computed by the hidden layers of the network. This makes it possible to solve tasks with complex deep networks that usually would not have had enough training data to train the network de novo. In many computer vision tasks, this approach works much better than handcrafted features that used to be state of the art for decades. In saliency prediction, for example, the use of pre-trained features has led to a dramatic improvement of the state of the art (). Similarly, transfer learning has proven extremely useful in the behavioral tracking of animals: using a pre-trained network and a small number of training images (200) for fine-tuning enables the resulting network to perform very close to human-level labeling accuracy ().

While machine learning had been studied only by a small crowd of academic researchers up until this decade, the success of deep learning in solving real-world problems has generated massive interest from industry and led to a complete paradigm shift in the field. Within a few years, machine learning has become the key technology used in virtually all AI applications. Importantly, the same learning approach that enabled AlphaGo to achieve superhuman performance in Go has also been used to learn other games like shogi or even win against some of the best chess programs like Stockfish. Because of the ability of this approach to generalize, it represents a much more profound leap in intelligence than Deep Blue.

The current state-of-the-art methods in machine learning are dominated by deep learning: multi-layer (deep) artificial neural networks (DNNs, Figure 1 ), which draw inspiration from the brain. Most fundamental is the idea of neurons as elementary adaptive nonlinear processing units (), which includes the notion of analog computation that is not well captured by the toolbox of formal logic (). Each artificial neuron aggregates inputs from other neurons using weighted summation analogous to synaptic weights of real neurons, followed by a simple nonlinearity such as a rectifier (ReLU) or a sigmoid function (logistic function or tanh) analogous to input-output nonlinearities of neurons. Deep networks arrange their neurons in several layers, where each layer provides the input to the neurons in the next layer, analogous to the multitude of hierarchically organized brain areas for processing visual information, for example. In some deep learning architectures, local competition implemented by a winner-take-all operation (max pooling) is reminiscent of local competitive inhibitory interactions in brain circuits. Despite these similarities, the elements of artificial neural networks strongly abstract from neurophysiological details. In convolutional networks, the linear summation coefficients are shared across space (i.e., there is a neuron with exactly the same linear receptive field at each spatial location), and during learning, weights change for all locations at once. This massively reduces the number of parameters that need to be learned from data. All neurons with the same receptive field shape (but shifted to different locations) are assembled into a “feature channel,” and there can be many feature channels per layer in a neuronal network. A lot of these ingredients have been around for several decades already, but thanks to a combination of training on very large datasets, advances in computing hardware, the development of software libraries, and a lot of tuning of the training schemes, it is now possible to train very large neural networks.

Shown is the large-scale feedforward convolutional network architecture of GoogLeNet (). Detail illustrates the smaller-scale structure within each layer, which comprises a set of feature maps and their input and output synaptic connections. Neurons in these layers typically tile visual space with spatially shifted copies of the same input and output weights (one example pattern of input weights is shown in red). For visual processing, this produces a three-dimensional array of neurons: lengthwidthfeatures. These neurons apply a simple nonlinear function to their pooled inputs.

The renaissance of AI is a result of a major shift of methods from classical symbolic AI to connectionist models used by machine learning. The critical difference from rule-based AI is that connectionist models are “trained,” not “programmed.” Searching through the space of possible combinations of rules in symbolic AI is replaced by adapting parameters of a flexible nonlinear function using optimization of an objective (goal) that depends on data. In artificial neuronal networks, this optimization is usually implemented by backpropagation, an algorithm developed by Paul Werbos in his PhD thesis in 1974 (). A considerable amount of effort in machine learning is being devoted to figuring out how this training can be done most effectively, as judged by how well the learned concepts generalize and how many data points are needed to robustly learn a new concept (“sample complexity”).

Our perception of what constitutes intelligent behavior and how we measure it has shifted over the years as tasks that were considered hallmarks of human intelligence were solved by computers while tasks that appear to be trivial for humans and animals alike remained unsolved. Classical symbolic AI focused on reasoning with rules defined by experts, with little or no learning involved. The rule-based system of Deep Blue, which defeated Kasparov in 1997 in chess, was entirely determined by the team of experts who programmed it. Unfortunately, it did not generalize well to other tasks. This failure and the challenge of artificial intelligence even today are summarized in “Moravec’s paradox” (): “it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.” While rules in symbolic AI provide a lot of structure for generalization in very narrowly defined tasks, we find ourselves unable to define rules for everyday tasks—tasks that seem trivial because biological intelligence performs so effortlessly well.

The brain is an intricate system distinguished by its ability to learn to perform complex computations underlying perception, cognition, and motor control—defining features of intelligent behavior. For decades, scientists have attempted to mimic its abilities in artificial intelligence (AI) systems. These attempts had limited success until recent years when successful AI applications have come to pervade many aspects of our everyday life. Machine learning algorithms can now recognize objects and speech and have mastered games like chess and Go, even surpassing human performance (i.e., DeepMind’s AlphaGo Zero). AI systems promise an even more significant change to come: improving medical diagnoses, finding new cures for diseases, making scientific discoveries, predicting financial markets and geopolitical trends, and identifying useful patterns in many other kinds of data.

compared the behavior of this BoF model to VGG-16, which is a widely used architecture in computer vision. First, they established that their BoF model can achieve a comparable performance to VGG-16 (). They subsequently showed that a number of key features of linear BoF models also hold true for VGG-16. First, the BoF model predicts that shuffling local features should not affect the classification performance as the local feature patch histogram is unaffected ( Figure 2 C). They further corroborated this by demonstrating that the performance of VGG-16 only drops fromtoon texturized images based on neural style transfer ( Figure 2 D) (). This suggests that in stark contrast to humans, VGG-16 does not rely on global shape integration for perceptual discrimination but rather on statistical regularities in the histogram of local image features. Second, the linear classifier on top of the bag of features predicts that manipulations in separate parts of the image should not interact, which they also find to be true for VGG. Finally, they demonstrate that BoF models and VGG-16 make similar errors, and, with the help of saliency techniques, show that VGG-16 uses very similar image features for decision making as BoF models. This indicates why VGG and similar networks generalize poorly: they are extracting local features and ignoring informative large-scale structure in the input data.

A recent study tested this hypothesis by probing deep networks for the kind of information used in decision making and found that they mostly rely on local features and largely ignore their spatial arrangement (). The key approach in this study was to build a network that had particular properties by design and to subsequently demonstrate that it behaves very similar to standard deep architecture. To this end, the authors designed networks in which neurons in the last convolutional layer only looked at very small patches in the input image. The activity of the final layer was subsequently summed across space before it was fed into a linear classifier for object recognition. By construction this network is invariant against the exact position of a particular patch in the image, which is why it was named the “Bag-of-Feature” (BoF) network. In addition to this invariance property, the design of the network also allowed the authors to quantify how much each image patch contributes to the decision of the network.

The lack of robustness to simple changes in the input statistics indicates that deep networks lack human-like scene understanding. In particular, they seem to lack integration of long-range dependencies between elements within images, such as different parts of an object.

A recent study bydemonstrated that humans generalize much better across different image distortions than deep networks, even though deep networks perform well on the distortions if they had access to them at training time ( Figure 2 B). The study explored the effect of twelve different types of low-level noise on the object recognition performance of both humans and machines. The latter were trained on either clean or distorted images (). When the networks were tested on the same domain on which they were trained (i.e., the same type of noise), they consistently outperformed human observers by a large margin, showing that the networks were able to “solve” the distortions under i.i.d. train-test conditions. However, when the noise distribution during testing differed from noise seen during training, the performance of the networks was very low even for small distortions.

Domain adaptation is another striking example of the difference in generalization between biological and artificial vision systems and an opportunity for benchmarking the robustness of machine learning algorithms with direct practical relevance (). Humans generalize across a wide variety of changes in the input distribution, such as vast differences in illumination, changing scene contexts, or image distortions from snowflakes or rain. While humans are certainly exposed to a number of such distortions during their lifetime, there seems to be a fundamental difference in how the visual system generalizes to new inputs from a distribution that has not been previously experienced. The ability to generalize beyond the standard assumption of independent and identically distributed (i.i.d.) samples at test time would be highly desirable for machine learning algorithms, as many real-world applications involve such shifts in the input distribution. For instance, recognition systems of autonomous driving cars should be robust against a large spectrum of weather phenomena that they might not have experienced at training time, such as ash falling from a nearby volcano. Thus, the general robustness against input distortions by different types of noise can be used as one relevant case study to test the generalization beyond the i.i.d. assumption in machines and humans.

One key problem in making networks less vulnerable to adversarial examples is the difficulty of reliably evaluating model robustness. It has been repeatedly shown () that virtually all defenses against adversarial examples proposed in the literature do not increase model robustness per se but merely prevent existing attacks from properly finding minimal adversarial examples. Until recently, the only defense considered effective () was a particular type of training explicitly designed to guard against adversarial attacks (). However, a recent paper () showed that the defending network does not learn more causal, human-like features but instead just exploits the binary nature of the dataset (MNIST, a collection of handwritten digits) and is thus unlikely to generalize to all natural images. Thus, current networks are not robust to adversarial examples, even on the simplest toy datasets of machine learning, such as MNIST. Understanding why the only existing robust systems—biological visual systems—are not vulnerable to adversarial computations could be an important guidance to the next generation of DNNs.

(D) Deep networks have a texture bias. When the shape and the texture of a class are put in conflict, deep networks tend to decide based on the texture while humans decide based on the shape ().

(C) Examples of original and texturized images using neuronal style transfer. A vanilla VGG-16 still reaches high accuracy on the texturized images while humans suffer greatly from the loss of global shapes in many images ().

(B) When networks are trained on standard color images and tested on color images, they outperform humans (top). Similarly, when trained and tested on images with the same type of noise, the performance is superhuman (middle). However, when tested on a different type of noise than at training time, the performance is at chance level (bottom). Human observers have no trouble classifying the images correctly ().

(A) While DNNs trained on object recognition reach almost human-like performance on clean images (top), there exist minimal perturbations (center) that, if added to the image, can completely derail their prediction (bottom). Perceptually, humans can see almost no difference between the clean and the perturbed image even with close inspection ().

A particularly striking example of the gap between humans and machines are “minimal adversarial perturbations” of the input image discovered in computer vision networks (). Adversarial perturbations are virtually imperceptible to humans but can flip the prediction of DNNs to any desired target class ( Figure 2 A). This means that the decision boundaries of all classes are extremely close to any given input sample. To the best of our current knowledge, this is not the case for humans under normal viewing conditions (one study finds a small effect under time limited viewing conditions []) and highlights that DNNs lack human-level scene understanding and do not rely on the same causal features as humans for visual perception.

Humans have impressive generalization capabilities, and behavioral neuroscience studies suggest that the ability to generalize categories and rules to novel situations and stimuli is also present in many other animals, including rodents (), birds (), and monkeys (). The exact meaning of “generalization beyond the training set” is much harder to define for animals that have had a lifetime of diverse visual experience with natural scene statistics () and generations of ancestors that were selected through evolutionary pressure to have a good architecture in that environment (). Nonetheless, it is clear that artificial networks lack several key generalization capabilities compared to biological brains.

The impressive—sometimes superhuman—performance of DNNs in many complex perceptual tasks might suggest that their sensory representations and decision making are similar to humans. Indeed, there seems to be an overlap between the sensory representations that DNNs trained on object recognition tasks create and representations measured in primate brains (). However, even though DNNs perform well when the conditions at training and test time do not differ too much, testing them outside of their training domain demonstrates that the nature of generalization and decision making is qualitatively different from biological sensory systems.

3. Better Generalization through Constraints

In some sense, it is surprising that ImageNet can be solved to high accuracy with only bags of small visual words. This finding alone already suggests that DNNs trained on this task learn only the statistical regularities present in local image features, since there is no selective pressure from the objective function used during training to do otherwise. Learning to extract larger image features such as global object shapes, which are highly variable and are presented only a small number of times (number of training images per class), is much more challenging than to learn the statistical relationship between class identity and thousands of local image features present in each sample. This inductive argument, as well as the additional evidence presented above, suggests that object recognition alone is insufficient to force DNNs to learn a more physical and causal representation of the world.

Cybenko, 1989 Cybenko G. Approximation by superpositions of a sigmoidal function. Geirhos et al. (2018) Geirhos R.

Temme C.R.M.

Rauber J.

Schütt H.H.

Bethge M.

Wichmann F.A. Generalisation in humans and deep neural networks. The shortcomings described above suggest that the next generation of intelligent algorithms will not be achieved by following the current strategy of making networks larger or deeper. Perhaps counterintuitively, it might be the exact opposite. We know already that networks have enough capacity to express most functions because the class of networks that have only one layer of neurons with sigmoidal activation functions can theoretically fit any continuous function provided there are enough neurons (). Even with a limited number of neurons, there is currently little evidence that deep networks are limited in their capacity to fit our current datasets. In fact, one of the first steps of practitioners is often to overfit the network on the training data to assert adequate power for a particular dataset. Similarly, the study on noise robustness bydiscussed above shows that networks can be trained on each single type of noise distortion, suggesting that network capacity is not the limit. Thus, there is probably a very large group of networks, our visual system included, that can solve single tasks such as ImageNet, but they might use vastly different solution strategies and exhibit quite different robustness and generalization properties. This implies that our current datasets, even though they contain millions of examples, simply do not provide enough constraints to direct us toward a solution that is similar enough to our visual system to exhibit its desirable robustness and generalization properties. Therefore, the challenge is to come up with learning strategies that single out those well-generalizing networks among the many networks that can fit a particular dataset. One way to do that is to constrain the class of networks to narrow it down to solutions that generalize well. In other words, we need to add more bias to the class of models.

Wolpert and Macready, 1995 Wolpert D.H.

Macready W.G. No free lunch theorems for search. Technical Report. Technical Report SFI-TR-95-02-010. Wolpert and Macready, 1997 Wolpert D.H.

Macready W.G. No free lunch theorems for optimization. It is helpful to distinguish two types of bias, which we will call “model bias” and “inductive (or learning) bias”. Model bias works like a prior probability in Bayesian inference: given some input that is inevitably ambiguous, a “fixed” network will favor certain interpretations over others or may exclude some interpretations entirely. “Inductive or learning bias” determines which fixed network is picked by the learning algorithm from the class of models given the set of training data. By “class of models,” we mean a set of functions from inputs to predictions. A learning algorithm picks one function from that set of functions (also called “hypothesis space”). For instance, for a given network architecture, all networks with different values for their synaptic weights constitute a model class. Once the weights are fixed, we get a single model from that class, with its own model bias—that is, its own way of interpreting new inputs. However, the model class could be much bigger and also include models with different network architectures. Which weights are learned (i.e., which inductive bias comes to bear) is affected by many aspects, such as the architecture, the learning rule or optimization procedure, the order in which data are presented, and the initial condition of the system. A good learning system for a particular problem will have an inductive bias that chooses networks that generalize well. Importantly, the inductive bias is ultimately problem specific. Mathematically, there is no universal inductive bias that works well on all problems (). In the following, we are mainly discussing ideas for how neuroscience can be used to influence the inductive bias of artificial systems.

Rebuffi et al., 2017 Rebuffi S.A.

Bilen H.

Vedaldi A. Learning multiple visual domains with residual adapters. van de Ven and Tolias, 2019 van de Ven G.M.

Tolias A.S. Three scenarios for continual learning. Zamir et al., 2018 Zamir A.

Sax A.

Shen W.

Guibas L.

Malik J.

Savarese S. Taskonomy: disentangling task transfer learning. Markram, 2006 Markram H. The blue brain project. Figure 3 Improving Deep Networks at Three Levels Show full caption Marr, 1982 Marr D. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. (A) Instead of training networks on narrow tasks like object classification, it will be better to use “multi-task training,” where the network is rewarded for correct performance on diverse low- and high-level tasks involving latent variables across scales and complexity. Networks can be trained to generate latent representations that are similar to those observed in functioning brains. Finally, networks can be endowed with biological structure at the implementational level, matching architectural and/or microcircuit features. These types of improvements relate to Marr’s three levels of analysis (). Zhang et al., 2016 Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization, International Conference on Learning Representations. https://openreview.net/forum?id=Sy8gdB9xx. (B) These different levels provide complementary constraints on the space of possible solutions. Many network architectures are so expressive that they can not only learn to provide natural images with appropriate labels but can even learn to match randomly permuted labels (). Such networks generalize weakly within their training set but perform poorly outside of that set. Multi-task training for the same network provides additional restrictions (blue). We get additional constraints by enforcing that hidden layers in artificial networks can predict neural responses, thereby pulling representations toward those of a successful strong generalization machine, the brain (red). Finally, by constraining network structures and operations to mimic those measured in the brain, canonical operations (green), We expect that the intersection of these constraints will produce networks that have stronger generalization performance. Biological systems can provide a source for inductive biases in several ways ( Figure 3 ). First, biological organisms need to learn continually with the same neural network and thus critically rely on generalization across different tasks and domains (). The more tasks to be solved with a single network, the fewer networks that can solve all of them and thus the stronger the resultant inductive bias on the class of models. The challenge is to define a good selection of tasks that can synergistically lead to a better bias and on which a single network can achieve a high generalization performance on all tasks (seefor a comparison of tasks in transfer learning). Because humans and other biological systems already solve a number of tasks with one brain, they can be a good source of inspiration to select tasks. Second, neurophysiological data provide a window into the evolved representations of a strongly generalizing network: the brain. By constraining an artificial network to match those representations (for example, by predicting the neural responses), we may bias the network toward reproducing the encoded latent variables that facilitate brain-like generalization. Third, the structure of a specific network introduces a particular inductive bias. This structure may be specified at a coarse scale, like the number, size, and connectivity between hidden layers or extent of neuromodulation, and at a fine scale, like the cell types, nonlinearities, canonical wiring rules in a local circuit, and local plasticity rules. One may also attempt to define structure at even smaller scales, looking at dendritic morphology or ion channel distribution (). This level of detail is, at present, only weakly constrained by available neuroscience data, and the benefits for machine learning remain unclear, so we will not consider this finest scale here. Instead, we will focus on nonlinear input-output relationships and patterns of synaptic weights in assemblies of neurons. In the next sections, we will describe how constraints at these three levels (computational, representational, and implementational) could help create better machine learning models as well as better models of the brain.