These results demonstrate an inherent tradeoff between adaptability (i.e., fast response to changes in the environment) and precision (i.e., correct estimation of reward probabilities) during learning and choice under uncertainty. Tackling this tradeoff requires the brain to adjust the learning rate either across time or across environments since the optimal learning rate could vary substantially depending on the levels of uncertainty and volatility present.

To detect the better option within each block of trials in the PRL task, the subject has to continuously update the estimates for reward value of the two options based on reward feedback. However, for models with constant learning rates, such as a simple RL model with one learning parameter (RL(1)), the optimal learning rates depend on the levels of uncertainty and volatility in the environment ( Figures 1 B and 1C). In a relatively stable environment with reward probabilities far from 0.5 (0.8/0.2 schedule with L = 80), a moderate value of the learning rate produces optimal performance ( Figure 1 B). However, in a more volatile environment (0.8/0.2 schedule with L = 20), a higher learning rate is more desirable. Finally, in an environment with greater uncertainty (0.6/0.4 schedule with L = 80), a lower learning rate is required to obtain a better estimate of reward value. Overall, the optimal learning rate of RL(1) increases as the environment becomes more volatile and slightly decreases when the environment becomes more uncertain ( Figure 1 C).

As a platform to study learning and choice under reward uncertainty, we focused on behavior during a PRL task. In this task, the subject selects between two alternative options (e.g., colored targets), which deliver reward probabilistically ( Figure 1 A). The probabilities of reward on the green and red options are complementary, for example 0.8 on the green and 0.2 on the red target. However, unknown to the subject, these probabilities switch after a certain number of trials referred to as the block length, L. The combination of reward probability on the better (more rewarding) and worse (less rewarding) options, p(B) and p(W), and block length defines an environment in this task (e.g., 0.8/0.2 schedule with L = 80). Performing this task requires selection of the better option within a given block of trials, which is complicated due to two factors: (1) the probabilistic nature of reward assignment or expected “uncertainty”; and (2) switches in reward probabilities between blocks of trials (reversals), resulting in unexpected uncertainty, also referred to as “volatility.”

(C) The optimal learning rate for the RL(1) model in different environments quantified with reward probability on the better and worse options and the block length, L. The optimal learning rate was smaller for more stable, and to a lesser extent, for more uncertain environments. White squares indicate sample environments chosen for further tests.

(A) Timeline of the PRL task and an example reward schedule. Subjects select between two options (e.g., red and green targets) and receive reward feedback on every trial. The reward is assigned probabilistically to one of the two targets while the better target changes between blocks of trials. In the shown example, probability of reward on the green target (p R (g)) changes between 0.8 and 0.2 after every 20 trials. Each cross shows reward assignment on a given trial.

We simulated the behavior of the model in a few environments with different levels of uncertainty and volatility, using the same set of parameters. Toward the end of each block in the stable environment, synapses associated with the better option increasingly occupied more stable strong meta-states, while a small fraction of them occupied the most unstable weak meta-states (W1; Figure 2 B). The volatile environment, on the other hand, made these synapses occupy mainly unstable strong meta-states or the most unstable weak meta-states ( Figure 2 C). This happened because, in the volatile environment, there was not enough time for synapses undergoing potentiation to occupy stable strong meta-states, whereas this was possible in the stable environment. In addition to the block length, the fractions of synapses in different meta-states were also influenced by reward probabilities (hence uncertainty). As the reward probabilities became closer to 0.5, thus increasing uncertainty, more synapses occupied more unstable weak meta-states and the fractions of synapses in strong meta-states monotonically decreased for more stable meta-states ( Figure 2 D). This happened because with more variable reward assignment, synapses associated with the better option were less likely to transition to stable (deep) meta-states. Overall, these results indicate that metaplastic synapses can adjust to reward statistics in the environment, in terms of both the volatility and uncertainty.

Metaplastic synapses can change their states stochastically depending on the choice and reward outcome at the end of each trial. We assumed that synapses associated with the chosen option are potentiated on rewarded trials and depressed on unrewarded trials ( Figure 2 A; see STAR Methods ). Because only one of the two options is assigned with reward on each trial of the PRL task, we also assumed that synapses associated with the unchosen option are depressed on rewarded trials and potentiated on unrewarded trials. Due to the stochastic nature of synaptic transitions, potentiation or depression events may or may not change synaptic efficacy of a particular synapse, but at the population level, result in a well-defined learning rule (Equations 3 and 4 in STAR Methods ).

(B) For synapses associated with the green target, the average (over many blocks) fractions of synapses in different strong (top) and weak (bottom) meta-states are plotted over time in the stable environment (0.8/0.2 schedule with L = 80). The x axis color indicates the better option within a given block and the inset shows the steady state of the fraction of synapses in each of four meta-states (computed by averaging over the last 2 trials within each block).

(A) The schematic of metaplastic synapses. Metaplastic synapses have multiple meta-states associated with each of the two levels of synaptic efficacy: weak (W) and strong (S). Potentiation and depression events result in stochastic transitions between meta-states with different levels of stability and are indicated by arrows (in gold and cyan for potentiation and depression events, respectively) and quantified by different transition probabilities (q 1 > q 2 > q 3 > q 4 and p 1 > p 2 > p 3 ). We also refer to more unstable and stable meta-states as “shallower” and “deeper” meta-states, respectively.

Here, we suggest that RDMP can provide a plausible mechanism for learning and choice under uncertainty and simulate the behavior of an example model based on RDMP during the PRL task. We compare the behavior of this model with three sets of models that rely on different mechanisms to deal with uncertainty and volatility in the PRL task (see STAR Methods ). In our model, neurons encoding the reward value of different options (value-encoding neurons) receive their inputs via metaplastic synapses that undergo a stochastic RDMP learning rule. These metaplastic synapses have multiple meta-states with different levels of stability associated with two levels of synaptic efficacy: weak and strong. The output of value-encoding neurons associated with a given option reflects the overall synaptic efficacy of metaplastic synapses onto them. Because there are two levels of synaptic efficacy, the overall synaptic efficacy for each set of metaplastic synapses can be quantified as the fraction of synapses in strong meta-states, which we refer to as the “synaptic strength.” Importantly, the RDMP learning rule enables the synaptic strengths to estimate the probability of reward for alternative options, whereas the model selects between the two options stochastically based on a probability given by the difference in synaptic strengths for the two options (see STAR Methods ).

Overall, these results show that metaplastic synapses adjust to reward statistics in the environment. This gives rise to time-dependent learning rates that are different for synapses associated with the two alternative options (i.e., learning rates are choice specific), and on rewarded and unrewarded trials. For simplicity, here we used a specific implementation of RDMP (Equation 2 in STAR Methods ), which guarantees that at the steady state, the effective learning rate for reward assignment on the better option is larger than the one on the worse option. Nevertheless, we found that most RDMP models with different formulations of transition probabilities exhibit such behavior, as long as there is an order in the transitions between meta-states resulting in shallow and deep meta-states (data not shown).

Although our model based on RDMP suggests that learning rates depend on whether the reward outcome supports the better or worse choice alternative, these rates are often estimated in empirical studies based on the reward outcomes independently of choice alternatives. To estimate such learning rates in our model, we computed the effective learning rates on rewarded and unrewarded trials, K(t) and K(t), by averaging the effective learning rates based on the reward outcome on a given trial (Equation 8 in STAR Methods ). We found that K(t) was smaller than K(t) at the beginning of each block ( Figure S1 ). However, as the model spent more time in a block, K(t) increased, whereas K(t) decreased such that K(t) became larger than K(t) later in a block. These changes resulted in an overall larger learning rate for rewarded than unrewarded trials, as observed in previous experiments (). In addition to changes in the learning rates over time (i.e., trial to trial), our model also predicts that the difference between the overall learning rates on rewarded and unrewarded trials should decrease with the uncertainty in the environment ( Figure S1 F). This prediction can be tested in future experiments.

The change in the effective learning rates over time as well as the difference between the two learning rates were sensitive to reward uncertainty and volatility in the environment. Specifically, the difference between K(t) and K(t) was more pronounced in a more certain environment than in an uncertain one (compare solid and dashed curves in Figure 3 A). Moreover, K(t) rose to a higher value while K(t) fell to a lower value in a stable than in a volatile environment (compare solid curves in Figure 3 A and its inset). The time-dependent adjustment to reward statistics was not specific to the example environments and was observed over a large set of environments with different levels of uncertainty and volatility. At the beginning of each block, K(t) was smaller than K(t) and this difference was stronger (more negative) for more certain or stable environments ( Figure 3 B). Over time, K(t) increased while K(t) decreased such that (K− K) becomes positive later in the block ( Figures 3 C and 3D). The difference between the steady state of the two learning rates increased as uncertainty and/or volatility decreased.

At the beginning of each reversal in the stable environment, synapses associated with the better option in the new block were mainly in stable weak meta-states or unstable strong meta-states since these synapses were associated with the worse option in the previous block ( Figure 2 B). On trials when reward was assigned to the better option, synapses in the stable weak meta-states slowly transition to strong meta-states, resulting in a small effective learning rate at the beginning of each block (solid gold curve, Figure 3 A). In contrast, on trials when reward was assigned to the worse option, synapses in the unstable strong meta-states quickly transition to weak meta-states resulting in a large value for the effective learning rate on those trials. Both of these effective learning rates change over time as the distribution of synapses in different meta-states adjusts to the recent reward statistics ( Figures 2 B–2D).

We found that the effective learning rates changed over time and depended on whether the reward was assigned to the better or worse option. For synapses associated with the better option, the effective learning rate on trials when the worse option was assigned with reward, K(t), monotonically decreased over time after a reversal (solid cyan curve in Figure 3 A). At the same time, however, the effective learning rate on trials when the better option was assigned with reward, K(t), monotonically increased. The amount of changes in the effective learning rates depended on the uncertainty and volatility such that these changes were larger for more certain and stable environments (see below).

(F–H) The overall change in the synaptic strength at three time points after a reversal in different environments. The model’s response to reward feedback was stronger for more certain and/or volatile environments right after reversals and this difference slowly decreased over time.

(E) Changes in model’s response to reward feedback over time. Plotted are the changes in the synaptic strength in response to reward assignment on the better (ΔF B+ ) or worse option (ΔF B- ), as well as the overall change in the synaptic strength (ΔF) as a function of the trial number after a reversal in the stable and uncertain environments.

(B–D) The difference between the effective learning rates at three time points after a reversal in different environments. Overall, K B+ increased while K B- decreased and their difference was larger for more certain and/or stable environments.

(A) The time course of the effective learning rate for when the reward was assigned to the better (K B+ ) or worse (K B- ) option during a given block in the stable (0.8/0.2 schedule with L = 80) and uncertain (0.6/0.4 schedule with L = 80) environments. The inset shows the results for the volatile environment (0.8/0.2 reward schedule with L = 20).

The fractions of synapses in different meta-states show how metaplastic synapses adjust to reward statistics in a given environment. Because different meta-states have different transition probabilities, those fractions also determine the speed of learning at a given point in time. To illustrate these, we calculated the “effective” learning rates as a function of the trial number after a reversal for the two possible outcomes of reward assignment. For any point during a block, the effective learning rates provide a single set of learning rates by considering the total change in the efficacy over all meta-states (see STAR Methods ). By definition, the product of the effective learning rate and the fraction of synapses in weak (strong) meta-states are equal to the increase (decrease) in synaptic strength.

Adjustments of Model’s Response to Reward Uncertainty and Volatility

We next examined how the model endowed with metaplasticity can adjust its response according to the uncertainty and volatility in the environment. To do so, we computed changes in the model’s response to reward feedback over time. Because choice behavior is determined by the synaptic strengths (Equation 1 in STAR Methods ), we first computed changes in the synaptic strengths due to the two types of reward feedback at different time points within a block of the PRL task.

B+ (t), was large immediately after a reversal, but then it slowly decreased over time (red curves in B+ (t). The ΔF B+ (t) gradually decreases as fewer synapses remain in weak meta-states. On the other hand, the change in the synaptic strength when reward was assigned to the worse option, ΔF B- (t), became stronger (more negative) over the span of about ten trials after a reversal, and later gradually became weaker (blue curves in B+ (t) and ΔF B- (t) were farther from zero but changed less over time in the uncertain compared to stable environment (compare dashed and solid curves in We found that the change in the synaptic strength when reward was assigned to the better option, ΔF(t), was large immediately after a reversal, but then it slowly decreased over time (red curves in Figure 3 E). This happens because immediately after a reversal, a large fraction of synapses associated with the currently better (previously worse) option is in weak meta-states, and the transition of these synapses due to a potentiation event results in a large ΔF(t). The ΔF(t) gradually decreases as fewer synapses remain in weak meta-states. On the other hand, the change in the synaptic strength when reward was assigned to the worse option, ΔF(t), became stronger (more negative) over the span of about ten trials after a reversal, and later gradually became weaker (blue curves in Figure 3 E). Importantly, the starting points of ΔF(t) and ΔF(t) were farther from zero but changed less over time in the uncertain compared to stable environment (compare dashed and solid curves in Figure 3 E). In contrast, we observed larger changes in response to reward feedback over trials within each block of the volatile environment (data not shown).

B+ (t) and ΔF B- (t) based on the reward probability in a given block (see To measure the model’s overall response to both types of reward feedback, we also computed a weighted average of the values of ΔF(t) and ΔF(t) based on the reward probability in a given block (see STAR Methods ). In the stable environment, ΔF(t) slowly decreased to zero after each reversal as the model reached the steady state within each block (solid black curve in Figure 3 E). In the uncertain environment, however, ΔF(t) was initially lower and decreased to zero more slowly (dashed black curve in Figure 3 E). These results demonstrate the overall ability of our model to adjust its response based on reward uncertainty in the environment.

To further examine adjustments due to metaplasticity, we simulated our model in various environments with different levels of uncertainty and volatility using one set of parameters. Immediately after each reversal, the model showed the greatest overall response to reward feedback; this response was larger for more certain and/or volatile environments ( Figure 3 F). As the overall response to reward feedback gradually approached zero, it still remained sensitive to the level of uncertainty in the environment ( Figures 3 G and 3H). Our model’s response to reward feedback was different from that of the RL models with constant learning rates. For example, in the RL(1) model, the change in the reward value due to reward feedback was similar in the stable and volatile environments ( Figure S2 A). Thus, unlike our model, the RL(1) model with a constant learning rate cannot adjust to the volatility in the environment.

Considering the observed adjustments in our model, we then compared our model’s overall response to both types of reward feedback using one set of parameters to that of the RL(1) model with the optimal learning rate in each environment ( Figures S2 B–S2D). The dependency on the uncertainty and volatility was qualitatively similar between the two models (compare Figures 3 F–3H with Figures S2 B–S2D). These results show that metaplasticity enables our model to adjust its behavior to the uncertainty and volatility consistently with the RL model with the optimal learning rate; that is, to increase response to reward feedback in more certain or volatile environments. Our model’s behavior, however, is not optimal and therefore shows deviations from what is prescribed by the optimal RL(1) model.

To achieve optimality, the RL(1) model prescribes smaller learning rates for more stable, and to a lesser extent, for more uncertain environments ( Figure 1 C). Such adjustment of the learning rate across environments is very different from how our model adjusts learning. First, our model naturally adopts two different time-dependent learning rates for reward assignments on the better and worse options. Second, the adjustment of these learning rates over time is qualitatively similar for more uncertain and more volatile environments ( Figures 3 B–3D), whereas the RL(1) model prescribes that the optimal learning rate should increase with larger volatility but slightly decrease with higher uncertainty ( Figure 1 C). Nevertheless, the opposite adjustment for uncertainty only weakly affects the performance in our model. This is because smaller differences between the fractions of synapses in the weak and strong meta-states in more uncertain environments cause smaller responses to reward feedback than in certain environments, irrespectively of the learning rates ( Figure 3 F). Similar behavior occurs in the RL models due to a smaller reward prediction error in uncertain compared to certain environments.

Our proposed RDMP model relies on an ordered architecture for transitions such that there are “shallow” and “deep” meta-states in the model. This architecture predicts that the model should be sensitive to the exact sequence of reward assignment. We found that after a sequence of consecutive reward assignments on the better option, the model responded very differently to another reward on the better option (congruent trial) versus reward on the worse option (incongruent trial), depending on the volatility of the environment ( Data S1 and Figure S3 ). Importantly, these responses were qualitatively different from those of the RL and hierarchical Bayesian models.

To summarize, we show that reward-dependent metaplasticity offers a plausible solution for the integration of reward in environments with different levels of uncertainty and/or volatility. The RDMP model predicts that the learning rates change over time and are different depending on whether the reward is assigned to the better or worse option. Moreover, the model predicts a specific pattern of response after congruent and incongruent sequences of reward assignment, which is qualitatively different from those of alternative models.