Learning and motivation are driven by internal and external rewards. Many of our day-to-day behaviours are guided by predicting, or anticipating, whether a given action will result in a positive (that is, rewarding) outcome. The study of how organisms learn from experience to correctly anticipate rewards has been a productive research field for well over a century, since Ivan Pavlov's seminal psychological work. In his most famous experiment, dogs were trained to expect food some time after a buzzer sounded. These dogs began salivating as soon as they heard the sound, before the food had arrived, indicating they'd learned to predict the reward. In the original experiment, Pavlov estimated the dogs’ anticipation by measuring the volume of saliva they produced. But in recent decades, scientists have begun to decipher the inner workings of how the brain learns these expectations. Meanwhile, in close contact with this study of reward learning in animals, computer scientists have developed algorithms for reinforcement learning in artificial systems. These algorithms enable AI systems to learn complex strategies without external instruction, guided instead by reward predictions.

The contribution of our new work, published in Nature (PDF), is finding that a recent development in computer science – which yields significant improvements in performance on reinforcement learning problems – may provide a deep, parsimonious explanation for several previously unexplained features of reward learning in the brain, and opens up new avenues of research into the brain’s dopamine system, with potential implications for learning and motivation disorders.

A chain of prediction: temporal difference learning

Reinforcement learning is one of the oldest and most powerful ideas linking neuroscience and AI. In the late 1980s, computer science researchers were trying to develop algorithms that could learn how to perform complex behaviours on their own, using only rewards and punishments as a teaching signal. These rewards would serve to reinforce whatever behaviours led to their acquisition. To solve a given problem, it’s necessary to understand how current actions result in future rewards. For example, a student might learn by reinforcement that studying for an exam leads to better scores on tests. In order to predict the total future reward that will result from an action, it's often necessary to reason many steps into the future.

An important breakthrough in solving the problem of reward prediction was the temporal difference learning (TD) algorithm. TD uses a mathematical trick to replace complex reasoning about the future with a very simple learning procedure that can produce the same results. This is the trick: instead of trying to calculate total future reward, TD simply tries to predict the combination of immediate reward and its own reward prediction at the next moment in time. Then, when the next moment comes, bearing new information, the new prediction is compared against what it was expected to be. If they’re different, the algorithm calculates how different they are, and uses this “temporal difference” to adjust the old prediction toward the new prediction. By always striving to bring these numbers closer together at every moment in time – matching expectations to reality – the entire chain of prediction gradually becomes more accurate.

Around the same time, in the late 80s and early 90s, neuroscientists were struggling to understand the behaviour of dopamine neurons. Dopamine neurons are clustered in the midbrain, but send projections to many brain areas, potentially broadcasting some globally relevant message. It was clear that the firing of these neurons had some relationship to reward, but their responses also depended on sensory input, and changed as the animals became more experienced in a given task.

Fortuitously, some researchers were versed in the recent developments of both neuroscience and AI. These scientists noticed, in the mid-1990s, that responses in some dopamine neurons represented reward prediction errors–their firing signalled when the animal got more reward, or less reward, than it was trained to expect. These researchers therefore proposed that the brain uses a TD learning algorithm: a reward prediction error is calculated, broadcast to the brain via the dopamine signal, and used to drive learning. Since then, the reward prediction error theory of dopamine has been tested and validated in thousands of experiments, and has become one of the most successful quantitative theories in neuroscience.

Distributional reinforcement learning