Discussion

Reading the fast pace advances made by OpenAI these past months, I feel a growing urge to talk about their work and share my thoughts on what I believe their work, and the advances of the field of AI as a whole, inform our understanding of how biological brains work. In particular this growing idea that the seemingly shared cognitive functions between human beings are not so much due to a shared structure that innately knows how to perform a task, but is instead a result of relatively similar naive structures that, confronted to the same environment, learn to perform similar tasks. The function being the result of a functionless structure that is only able to learn a specific task because of a specific environment rather than a structure that is able to do the task natively, simply tweaking a couple of parameters to adapt to the environment.

Tasks versus configurations: a seemingly arbitrary definition

I must admit I do not understand why they chose to talk about different tasks the way they did. A task is defined in the block stacking experiment as a set of strings representing the position of blocks relative to each other, the number of elements in the set defines the number of stacks and the number of characters the number of block that needs to be arranged. A task then is an arrangement of blocks in stacks irrespective of the absolute position of the stack.

Some blocks might be on the table but not part of the task

Their choice of defining relative position and number of stacks as criteria for separate task seems arbitrary. Indeed, it could also make sense to talk about different tasks based on the absolute starting positions of the blocks(what they refer to as configuration). I believe the common nature of the problem is evident to them, but for clarity purposes they prefer not to go into the details. It does make more sense to frame the policy learning as two type of generalizations, the way they do later on:

Note that generalization is evaluated at multiple levels: the learned policy not only needs to generalize to new configurations and new demonstrations of tasks seen already, but also needs to generalize to new tasks.

Just replace “tasks” by “stack orderings”. To correctly learn the task means that the agent learns an embedding able to abstract the position of the cubes (configuration), but also their identity (task), the number of stacks (task), and the trajectory of the demonstration (introduced briefly in the quote) to produce a relevant motor response.

Those generalizations seem contradictory, how can the same network abstract the cube’s initial configuration or their identity and yet recover their absolute position for the motor response?

This explains the need for different cooperative subnetworks during learning, receiving different inputs, and it explains that in the context network an abstract representation of the task is fed lower order information, like cubes absolute positions, before the descending command.

You might think commenting on this distinction of task and configuration is silly, but it is essential to understand that it is in essence the same process of abstraction at play on different objects (and this opens for the following section).

There is no learning without invariance

Transfer learning is maybe the most fascinating concept of cognition whether it be in-silico or in-vivo, it is a very hot topic both for AI researchers and Neuroscientists, and it happens to be the subject of my PhD thesis. Note that closely related concepts have been explored in many fields before machine-learning, and this abstract and always partially defined concept has many names. Philosophers, anthropologists and sociologists might refer to it as (Post-)Structuralism (Claude Levi-Strauss, Michel Foucault), Linguist will talk about Syntagma and Nested Tree structures (Noam Chomsky), Mathematicians will probably think of Homeomorphism or Invariants, and Education researchers or Neuroscientists may refer to it as Structural Learning. You might also see related concept in the field of machine learning like representation learning and meta-learning, which depending on the author might refer to transfer learning or the learning paradigm used to perform transfer learning. When talking about Deep Neural Networks these differences are blurred, as in essence a Neural net is learning to embed a certain problem (representation learning) by modifying its structure (meta-learning) usually in a noisy environment which implies a form of transfer learning.

AI researchers and Cognitive Scientist have often a very concrete definition of transfer learning, it is the process that allows a system to use the knowledge acquired in a certain task to perform another task sharing a common compositional structure (as described in the article). Cognitive science has this notion of near and far transfer, depending of how the two tasks seem to differ. But from a more abstract perspective, in a noisy and complex environment, all learning is a form of transfer learning and the difference between very near and very far transfer is only a matter of shared information — again a matter of scale not of nature.

In controlled environment, efforts are made beforehand to build a hard coded discretisation of reality, but in fact this discretisation reproduces procedurally what transfer learning does, it unites an infinite set of states found in reality under a common enclosing structure. In essence Transfer Learning refers directly or by extension to the process through which learning agents use invariants to build models of the world. It is a process that uses similarities, repetitions, and variations of the same, to form increasingly abstract and composed representation that will structure ensembles over the variance span by the input. In a general sense it allows to create the basic operations through which we manipulate information groups, much like in mathematics it allows for union and intersections. It allows identities, it explains our ability to categorise objects. Josh Tenembaum gives an example that really spoke to me: imagine you are teaching a two year old child to recognise a horse for the first time, you show him a couple of picture of different horses and then you show him the picture of another horse and the picture of a house and ask him to tell you which one is the horse. A child will do this task quite easily but it is still something a computer cannot do well with so few inputs (one-shot learning).

How did the child do it ?

Animal recognition has been studied in children and relate to our ability to deconstruct objects into relevant parts, the color range of the fur, the size of the neck, the overall shape etc.. This ability is also what allows you to open a door you have never seen before, you have learned a motor sequence that generalize to any situation (domain generalisation). It is also what you use to build explanatory models that simplify the world, you might indeed be surprised initially by the sudden apparition of a Cuckoo in a famous Swiss clock, but after the second appearance you will expect it. Finding invariance is how a neural network learns and those models are built unconsciously. An example is how we learn intuitively about physics even before having heard of mathematics and numbers.

One may ask for example how fast would a child born in microgravity adapt to earth’s gravity and learn intuitively that objects will fall to the ground when dropped ?

We might hypothesize that infants and most animals will revise their model unconsciously, much like when you put socks on the paws of a dog and it takes it some time to adapt to the new informations.

But for a young child a conscious interrogation and revision of his intuitive model will take place, from curiosity, through language, symbols and beliefs. Our ability to consciously interrogate and change our models is fascinating, and as a sidenote, humans may be the only species able to verbalise the process but other species may perform similar conscious revisions.

Invariance is an obligatory property of time, if everything was always new and in no way predictable, there would still remain this unique invariant that everything is always new and unpredictable. It is impossible to imagine a world without invariance, since there could not be a world to refer to, without invariance life would be impossible and our brains useless. Life is a machine that works only by the predictable repetition of events, repetition of causes and effects, of cyclic reintroduction of energy into the organism. And in Life’s quest to improve its use of those necessary cycles, our brain is the ultimate tool. It is a prediction machine, an adaptive organ able to find repetition dynamically and use it to better interact with the world.

This method that life chose is extremely robust to slight changes in the structure. What remains the same is the world, the statistical properties of the environment, but the neural structure encountering it can vary as long as it can embed the relevant information it evolved to treat. This explains why our brains can be so different from individual to individual, even primary cortices, and yet share the same functions.

Nervous systems are adaptive, they do not need evolution and slow genetic mutations to alter behavior in relevant ways. A simple nervous system, such as the ones found in C. Elegans, serves as an innate internal coordinator and external sensor: sense food and move towards it, flee from pain, reproduce. Those simple systems were initially rigid and performing extreme approximation of our highly noisy world in order to discretize it in a small set of possible states (food on the left, heat below etc.). Our motor and sensory abilities evolved hand in hand with our nervous system predictive capabilities. As our sensors became more precise, the nervous system slowly became able to modify its structure to store information and learn from experience. Initially it became able to learn to recognise certain categories of inputs, such as types of smells or light patterns, and also became able to learn through trial and error to control its increasingly complex motor system. Note that the world is so complex that our brain naturally evolved toward a learning paradigm rather than an innate procedural approach. Computationally this make perfect sense, a simple game of Go has a state-space far larger (2.10¹⁷⁰) than the number of atoms in the universe (10⁸⁰), and as organisms become more complex trying to hard-code approximations of all the possible states it could be in rapidly becomes intractable due to combinatorial explosion.

Some people might believe our brain is built in such a way that it innately represents the space it is going to evolve in, that in the DNA somewhere there is a gene for what constitutes a face, or the temporal organisation of the sound waves that make up words. They might believe that this innate knowledge is encoded at birth somewhere. Others might believe, like my philosophy teacher when I was in high school, that existence precedes essence, and that our brain is completely and solely defined by the encounter of the organism and world. The reality is of course more complex, and for most telencephalic systems that have been studied so far, the brain does not encode innately the function that it will perform but will learn it depending on the information contained in its inputs. If the input is too poor in relevant information, the capacity to learn in those structure may have an expiration date (e.g. Amblyopia). But if the innate structure does not encode the final function, the brain does have a specific structure. This structure is preserved across individuals, and individuals of the same species share common functions and drives. DNA does set up a certain structure in place, a structure not able to perform their final function innately, but a structure able to learn the complexity of specific tasks based on individual experience. It is not surprising that evolution led to the apparition of an highly effective blood-brain barrier isolating the brain from the rest of the body as well as the meninges and the hard bone shell protecting it from the outside world, because unlike other organs in which the structure is encoded in the genome, the structure of a trained brain cannot be regenerated from an innately stored model. What is fascinating is that we see the same learning mechanisms arising by analogy through the development of increasingly complex deep networks performing increasingly complex tasks.

Compositional structures are hard to see but everywhere

As a sidenote it is strange that even the authors do not recognize that their first task of target reaching has a compositional structure.

The particle reaching tasks nicely demonstrates the challenges in generalization in a simplistic scenario. However, the tasks do not share a compositional structure, making the evaluation of generalization to new tasks challenging.

Although the structure is indeed lower level than the block stacking, and not readily accessible to experimental manipulation, the task is indeed a composed of shared structure. Approximating the world to a plane, one compositional structure is that cube identity (color) is preserved with translation, and going from block A -or a random starting position- at position (Xa1,Ya1) to block B at position (Xb1,Yb2) is part of the same higher order compositional structure than going from block A at position (Xa2, Ya2) to block B at position (Xb2, Yb2).

Interfaces between networks

Agencement of neural networks able to treat inputs at different levels of abstraction will need interfaces, a domain that I believe presents much left to discover. Those interfaces can be of numerous nature. They can be for example be seen as a common language between two networks, as demonstrated in the article, a lower level network armed with an attention system (demonstration network) can translate a demonstration in a representation another network (the context network) can use to direct action whatever the length or initial configuration of the demonstration.

The surface of this language is here a plane, fixed in size, but one can imagine possible alterations that could improve communications between the network. For example the size of the surface could be set to grow or shrink dynamically as the networks interact during learning, hence compressing or extending the language complexity. We could also imagine more dynamic interactions, through feedback for example. We could imagine the existence of facilitator networks which would learn to smooth communication between networks, existing as a parallel network that learn to modulate the input of the first network based on the input and output of the second network. We could imagine complex context networks that act as tonic (slow varying) influx to multiple more specialized networks… Fascinating future area of research !

Failures cases hint at the possible roles new modules could have

It is worth noting that errors are often due to motor mistakes, and that the number of mistakes increases with the complexity of the task.

Motor function should not be deteriorated only by increasing the number of targets, this is a strong evidence that the way the reproduction network learns to talk to the motor network is too abstract. It is strange because they say their test shows that the interface between the context network and motor network is relatively concrete (position of the robot, position of the target).

Possible solution could be, since this is a modular architecture, to use different loss functions, or modular loss functions representing each a specific aspect of the task. It would also be helped by an equivalent of the brain pre-motor areas to insure the demonstration and context network can remain abstract without deteriorating the motor command. Premotor regions are necessary to better localize objects based on the goal (from abstract networks) and the sensory inputs, in order to select the best motor command. It seems the context network is both trying to transfer the demonstration to a higher level embedding and prepare motor action at the same time in a current context. A pre-motor network’s role would be to learn to communicate with the motor system in a goal oriented and adaptive manner, combining both the functions of the premotor and the cerebellum for motor learning and fast adaptation.

There is an interesting theory, the Moravec’s paradox, that predicts that it will not be higher level cognition that will be computationally taxing but the treatment of sensory inputs and motor systems outputs. This could indeed account for the large amount of neurons present in our cerebellum (more than in the rest of our brain) to adaptively control motor action. This paradox was formulated in a time (the 80’s) when we still believed we could embed our own knowledge into a machine to perform complex task in uncontrolled noisy environments. Of course this paradox makes sense if somehow the machine is able to represent the world in a discretized set of states, building higher level function upon it would be easier. But I believe both will prove to be extremely taxing, and the internal representation used at the interface between networks will be far from anything resembling our own conscious representations.

Conclusion

By combining different neural networks each in charge of a specific treatment of the problem, this article shows is that by creating a task that inherently needs generalization, and building an appropriate learning environment through domain randomisation, a neural network with access to a memory and an attention system can learn to generalize beyond simple reproduction. It can learn to discover a higher order goal that has been demonstrated only once in a visual stream on information, and performs computation in a generalized space to recover the appropriate actions able to reproduce that goal in a different context.

In the future we will see an increasing complexity of structures built upon those atomic building blocks able to learn to generalize complex tasks but more importantly perform several of such tasks, in new environments, with less reliance on hard coded methods such as preprocessing of inputs or memory storage. Memory storage will be replaced by distributed representations across a memory network, attentional systems will be replaced by cyclic activity in real time attentional networks. The question remains how we will be able to adapt a strong serial technology (Turing machines) to our increased reliance on distributed computing in embodied system.