Vocal control in songbirds is a powerful model system for examining sensorimotor learning of complex tasks ( 36 ). The phenomenology we are trying to explain arises from experimental approaches to inducing song plasticity ( 33 ). Songbirds sing spontaneously and prolifically and use auditory feedback to shape their songs toward a “template” learned from an adult bird tutor during development. When sensory feedback is perturbed (see below) using headphones to shift the pitch (fundamental frequency) of auditory feedback ( 33 ), birds compensate by changing the pitch of their songs so that the pitch they hear is closer to the unperturbed one. As shown in Fig. 2A , the speed of the compensation and its maximum value, which is measured as a fraction of the pitch shift and referred to as the magnitude of learning hereafter, decrease with the increasing shift magnitude, so that a shift of three semitones results in near-zero fractional compensation. Crucially, the small compensation for large perturbation does not reflect the limited plasticity of the adult brain since imposing the perturbation gradually, rather than instantaneously, results in a large compensation ( Fig. 2B ).

We use experimental data collected in our previous work ( 8 , 33 ) to develop our mathematical model of learning. As detailed in ref. 39 , we used a virtual auditory feedback system ( 8 , 40 ) to evoke sensorimotor learning in adult songbirds. For this, miniature headphones were custom fitted to each bird and used to provide online auditory feedback in which the pitch (fundamental frequency) of the bird’s vocalizations could be manipulated in real time, with a loop delay of roughly 10 ms. In addition to providing pitch-shifted feedback, the headphones largely blocked the airborne transmission of the bird’s song from reaching the ear canals, thereby effectively replacing the bird’s natural airborne auditory feedback with the manipulated version. Pitch shifts were introduced after a baseline period of at least 3 d in which birds sang while wearing headphones but without pitch shifts. All pitch shifts were implemented relative to the bird’s current vocal pitch and were therefore “correctable” in the sense that if the bird changed its vocal pitch to fully compensate for the imposed pitch shift, the pitch of auditory feedback heard through the headphones would be equal to its baseline value. All data were collected during undirected singing (i.e., no female bird was present).

Mathematical Model.

To describe the data, we introduce a dynamical Bayesian filter model (Fig. 1A). We focus on just one variable learned by the animal during repeated singing—the pitch of the song syllables. Even though the animal learns the motor command and not the pitch directly, we do not distinguish between the produced pitch ϕ and the motor command leading to it because the latter is not known in behavioral experiments. We set the mean “baseline” pitch sung by the animal as ϕ = 0 , representing the “template” of the tutor’s song, or the scalar target memorized during development, and nonzero values of ϕ denote deviations of the sung pitch from the target.

However, while an instantaneous output of the motor circuit in our model is a scalar value of the pitch, the state of the motor learning system at each time step is a probability distribution over motor commands that the animal expects can lead to the target motor behavior. This is in contrast to the more common assumption that the state of the learning system is a scalar, usually the mean behavior, which is then corrupted by the downstream noise (34). Thus, at time t, the animal has access to the prior distribution over plausible motor commands, p p r i o r ( ϕ t ) . We remain deliberately vague about how this distribution is stored and updated in the animal memory (e.g., as a set of moments, or values, or samples, or yet something else) and focus instead not on how the neural computation is performed, but on modeling which computation is performed by the animal. We assume that the bird randomly selects and produces the pitch from this distribution of plausibly correct motor commands. In other words, we suggest that the experimentally observed variability of sung pitches is dominated by the deliberate exploration of plausible motor commands, rather than by noise in the motor system. This is supported by the experimental finding that the variance of pitch during singing directed at a female (performance) is significantly smaller than the variance during undirected singing (practice) (4, 41).

After producing a vocalization, the bird then senses the pitch of the produced song syllable through various sensory pathways. Besides the normal airborne auditory feedback reaching the ears, which we can pitch shift, information about the sung pitch may be available through other, unmanipulated pathways. For example, efference copy may form an internal short-term memory of the produced specific motor command (42). Additionally, proprioceptive sensing presumably also provides unshifted information (43). Finally, unshifted acoustic vibrations might be transmitted through body tissue in addition to the air, as is thought to be the case in studies that use pitch shifts to perturb human vocal production (44, 45).

We denote all feedback signals as s t ( i ) where the index i denotes different sensory modalities. Because sensing is noisy, feedback is not absolutely accurate. We posit that the animal interprets it using Bayes’ formula. That is, the posterior probability of which motor commands would lead to the target with no error is changed by the observed sensory signals, p p o s t ( ϕ t ) ∝ p l i k e l i h o o d ( { s t ( i ) } | ϕ t ) p p r i o r ( ϕ t ) , where p l i k e l i h o o d represents the probability of observing a certain sensory feedback value given the produced motor command ϕ t was the correct one. In its turn, the motor command is chosen from the prior distribution, p p r i o r , which represents the a priori probability of the command to result in no sensory error. In other words, if the sensory feedback indicates that the pitch was likely too high, then the posterior is shifted toward motor commands that have a higher probability of producing a lower pitch and hence no sensory error—similar to how an error would be corrected in a control-theoretic approach to the same problem. We discuss this in more detail below.

Finally, the animal expects that the motor command needed to produce the target pitch with no error may change with time because of slow random changes in the motor plant. In other words, in the absence of new sensory information, the animal must increase its uncertainty about which command to produce with time (this is a direct analogue of increase in uncertainty of the Kalman filter without new measurements). Such increase in uncertainty is given by p p r o p ( ϕ t + δ t | ϕ t ) , the propagator of statistical field theories (46). Overall, this results in the distribution of motor outputs after one cycle of the model p p r i o r ( ϕ t + δ t ) = 1 Z ∫ p p r o p ( ϕ t + δ t | ϕ t ) × p l i k e l i h o o d ( { s t ( i ) } | ϕ t ) p p r i o r ( ϕ t ) d ϕ t , [1]where Z is the normalization constant.

We choose δ t to be 1 d in our implementation of the model and lump all vocalizations (which we record) and all sensory feedback (which are unknown) in one time period together. That is, we look at timescales of changes across days, rather than faster fluctuations on timescales of minutes or hours. This matches the temporal dynamics of the learning curves (Fig. 2 A and B). Since the bird sings hundreds of song bouts daily, we now use the law of large numbers and replace the unknown sensory feedback for individual vocalizations by its expectation value s t ( i ) → s ¯ t ( i ) . For simplicity, we focus on just two sensory modalities, the first one affected by the headphones and the second one not affected, and we remain agnostic about the exact nature of this second modality among the possibilities noted above. Thus, the expectation values of the feedbacks are the shifted and the unshifted versions of the expected value of the sung pitch, s ¯ t ( 1 ) = ϕ ¯ t − Δ and s ¯ t ( 2 ) = ϕ ¯ t , where − Δ is the experimentally induced shift (more on the minus sign below). Note that since ϕ t is the motor command that the animal expects to produce the target pitch, the term p l i k e l i h o o d ( s t ( i ) | ϕ t ) should be viewed as the probability of generating the feedback s t ( i ) given that ϕ t was the correct motor command or as a likelihood of ϕ t being the correct command given the observed s t ( i ) . This introduces a negative sign, the compensation, into the analysis—for a positive s t ( i ) , the most likely ϕ t to lead to the target is negative and vice versa. While potentially confusing, this is the same convention that is used in all filtering applications—a positive sensory signal means the need to compensate and to lower the motor command, and the negative signal leads to the opposite. In other words, the bird uses the sensory feedback to determine what it should have sung and not only what it sang. With that, we refer to the conditional probability distributions p l i k e l i h o o d ( s t ( i ) | ϕ t ) for each sensory modality i as the likelihood functions L i ( ϕ t ) for a certain motor command being the target given the observed sensory feedback. Thus, assuming that both sensory inputs are independent measurements of the motor output, we rewrite Eq. 1 as p p r i o r ( ϕ t + δ t ) = 1 Z ∫ p p r o p ( ϕ t + δ t | ϕ t ) × L 1 ( ϕ t ; Δ ) L 2 ( ϕ t ; 0 ) p p r i o r ( ϕ t ) d ϕ t , [2]where 0 and Δ represent the centers of the likelihoods (or the maximum likelihoods). This explains our choice of denoting the experimental shift − Δ , so that the compensation by the animal is instead + Δ , and L is centered on + Δ as well. Note that the likelihoods for the shifted and unshifted modalities are centered around Δ and 0, respectively, and bias the learning of what should be sung toward these centers irrespective of the current value of ϕ t . We emphasize again that, in this formalism, we do not distinguish the motor noise and the sensory noise and assume that both are smaller than the deliberate exploratory variance (which is supported by the substantial variance reduction in directed vs. undirected song). This is consistent with not distinguishing individual vocalizations and focusing on time steps of 1 d in the Bayesian update equation above.

As illustrated in Fig. 1B, such Bayesian filtering behaves differently for Gaussian and heavy-tailed likelihoods and propagators. Indeed, if the two likelihoods are Gaussians, their product is also a Gaussian centered between them. In this case, the learning speed of an animal is linear in the error Δ, no matter how large this error is, which conflicts with the experimental results in songbirds and other species (5, 8, 22, 36). Similarly, if the two likelihoods have long tails, then when the error is small, their product is also a single-peaked distribution as in the Gaussian case. However, when the error size Δ is large, the product of such long-tailed likelihoods is bimodal, with evidence peaks at the shifted and the unshifted values, with a valley in the middle. Since the prior expectations of the animal are developed before the sensory perturbation is turned on, they peak near the unshifted value. Multiplying the prior by the likelihood then leads to suppression of the shifted peak and hence of large error signals in animal learning.