As we saw above, Bayesian methods provide a powerful framework for studying human behavior and adaptive processes in particular. For instance, [55] first defined a multi-layered generative model for sequences of input stimuli. By inverting this stochastic forward process, they could extract relevant descriptors at the different levels of the model and fit these parameters with the recorded behavior. Here, we use a similar approach, focusing specifically on the binary switching generative model, as defined in Eq 1. To begin, we define as a control a first ideal observer, the leaky integrator (or forgetful agent), which has an exponentially-decaying memory for the events that occurred in the past trials. This agent can equivalently be described as one which assumes that volatility is stationary with a fixed characteristic frequency of switches. Then, we extend this model to an agent which assumes the existence of (randomly occurring) switches, that is, that the agent is equipped with the prior knowledge that the value of the probability bias may change at specific (yet randomly drawn) trials, as defined by the forward probabilistic model in Eq 1.

The correspondence that we proved between the weighted moving average heuristic and the forgetful agent model as an ideal solution to that generative model leads us to several interim conclusions. First, the time series of inferred values can serve as a regressor for behavioral data to test whether human observers follow a similar strategy. In particular, the free parameter of the model (h), may be fitted to the behavioral dataset. Testing different hypothesis for the value of h thus allows to infer the agents’ most likely belief in the (fixed) weight decay. Now, since we have defined a first generative model and the corresponding ideal observer (the forgetful agent), we next define a more complex model, in order to overcome some of the limits of the leaky integrator. Indeed, a first criticism could be that this model is too rigid and does not sufficiently account for the dynamics of contextual changes [ 60 ] as the weight decay corresponds to assuming a priori a constant precision in the data sequence, contrary to more elaborate Bayesian models [ 61 ]. It seems plausible that the memory size (or history length) used by the brain to infer any event probability can vary, and that this variation could be related to an estimate of environmental volatility as inferred from past data. The model presented in Eq 3 uses a constant weight for all trials, while the actual precision of each trial can be potentially evaluated and used for precision-weighted estimation of the probability bias. To address this hypothesis, our next model is inspired by the Bayesian Change-Point detection model [ 58 ] of an ideal agent inferring the trajectory in time of the probability bias ( ), but also predicting the probability of the occurrence of switches.

In other words, the predicted probability is computed from the integration of previous instances with a progressive discount of past information. The value of the scalar h represents a compromise between responding rapidly to changes in the environment (h ≈ 1) and not prematurely discarding information still of value for slowly changing contexts (h ≈ 0). For that reason, we call this scalar the hazard rate in the same way to that defined for the binary switching generative model presented above (see Eq 1 ). Moreover, one can define τ = 1/h as a characteristic time (in units of number of trials) for the temporal integration of information. Looking more closely at this expression, the “forgetful agent” computed in Eq 2 consists of an exponentially-weighted moving average (see Appendix). It may thus be equivalently written in the form of a time-weighted average: (3) The first term corresponds to the discounted effect of the prior value, which tends to 0 as t increases. More importantly, as 1 − h < 1, the second term corresponds to the leaky integration of novel observations. Inversely, let us now assume that the true probability bias for direction changes randomly with a mean rate of once every τ trials: . As a consequence, the probability that the bias does not change is at each trial. Assuming independence of these occurrences, the predicted probability is thus proportional to the sum of the past observations weighted by the belief that the bias has not changed during i trials in the past, that is, exactly as defined by the second term of the right-hand side in Eq 3 . This shows that assuming that changes occur at a constant rate ( ) but ignoring more precise information on the temporal occurrence of the switch, the optimal solution to this inference problem is the ideal observer defined in Eq 3 , which finds an online recursive solution in Eq 2 . We therefore proved here that the heuristic derived for the leaky integrator is an exact inversion of the two-layered generative model which assumes a constant epoch-duration between switches of the probability bias.

The leaky integrator ideal observer represents a classical, widespread and realistic model of how trial-history shapes adaptive processes in human behavior [ 59 ]. It is also well adapted to model motion expectation in the direction-biased experiment which leads to anticipatory pursuit. In this model, given the sequence of observations from trial 0 to t, the expectation of the probability for the next trial direction can be modeled by making a simple heuristic [ 59 ]: This probability is the weighted average of the previously predicted probability, , with the new information , where the weight corresponds to a leak term (or discount) equal to (1 − h), with h ∈ [0, 1]. At trial t, this model can be expressed with the following equation: (2) where is equal to some prior value (0.5 in the unbiased case), corresponding to the best guess at t = 0 (prior to the observation of any data).

Binary Bayesian Change-Point (BBCP) detection model.

There is a crucial difference between the forgetful agent presented above and an ideal agent which would invert the (generative) Binary Switching model (see Eq 1). Indeed, at any trial during the experiment, the agent may infer beliefs about the probability of the volatility which itself is driving the trajectory of the probability bias . Knowing that the latter is piece-wise constant, an agent may have a belief over the number of trials since the last switch. This number, that is called the run-length rt [58], is useful in two manners. First, it allows the agent to restrict the prediction of only based on those samples produced since the last switch, from t − rt until t. Indeed, the samples which occurred before the last switch were drawn independently from the present true value and thus cannot help estimating the latter. As a consequence, the run-length is a latent variable that captures at any given trial all the hypotheses that may be occurring. Second, it is known that for this estimate, the precision (that is, the inverse of variance) on the estimate grows linearly with the number of samples: The longer the run-length, the sharper the corresponding (probabilistic) belief. We have designed an agent inverting the binary switching generative model by extending the Bayesian Change-Point (BCP) detection model [58]. The latter model defines the agent as an inversion of a switching generative model for which the observed data (input) is Gaussian. We present here an exact solution for the case of the Binary Switching model, that is, for which the input is binary (here, left or right).

In order to define in all generality the change-point (switch) detection model, we will initially describe the fundamental steps leading to its construction, while providing the full algorithmic details in Appendix. The goal of predictive processing at trial t is to infer the probability of the next datum knowing what has been observed until that trial (that we denote by ). This prediction uses the agent’s prior knowledge that data is the output of a given (stochastic) generative model (here, the Binary Switching model). To derive a Bayesian predictive model, we introduce the run-length as a latent variable which gives to the agent the possibility to represent different hypotheses about the input. We therefore draw a computational graph (see Fig 2A) where, at any trial, an hypothesis is formed on as many “nodes” than there are run-lengths. Note that run-lengths may be limited by the total number of trials t. As a readout, we can use this knowledge of the predictive probability conditioned on the run-length, such that one can compute the marginal predictive distribution: (4) where is the probability of the Bernoulli trial modeling the outcome of a future datum , conditioned on the run-length and is the probability for each possible run-length given the observed data. Note that we know that, at any trial, there is a single true value for this variable rt and that thus represents the agent’s inferred probability distribution over the run-length r. As a consequence, is scaled such that .

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Binary Bayesian Change-Point (BBCP) detection model. (A) This plot shows a synthesized sequence of 13 events, either a leftward or rightward movement of the target (TD). Run-length estimates are expressed as hypotheses about the length of an epoch over which the probability bias was constant, that is, the number of trials since the last switch. Here, the true probability bias switched from a value of .5 to .9 at trial 7, as can be seen by the trajectory of the true run-length (blue line). The BBCP model tries to capture the occurrences of a switch by inferring the probability of different possible run-lengths. At any new datum (trial), this defines a Hidden Markov Model as a graph (trellis), where edges indicate that a message is being passed to update each node’s probability (as represented by arrows from trial 13 to 14). Black arrows denote a progression of the run-length at the next step (no switch), while gray lines stand for the possibility that a switch happened: In this case the run-length would fall back to zero. The probability for each node is represented by the grey scale (darker grey colors denote higher probability) and the distribution is shown in the inset for two representative trials: 5 and 11. Overall, this graph shows how the model integrates information to accurately identify a switch and produce a prediction for the next trial (e.g. for t = 14). (B) On a longer sequence of 200 trials, representative of a trial block of our experimental sequence (see Fig 1A), we show the actual events which are observed by the agent (TD), along with the (hidden) dynamics of the true probability bias P true (blue line), the value inferred by a leaky integrator (P leaky , orange line) and the results of the BBCP model in estimating the probability bias P BBCP (green line), along with .05 and .95 quantiles (shaded area). This shows that for the BBCP model, the accuracy of the predicted value of the probability bias is higher than for the leaky integrator. Below, we show the belief (as grayscales) for the different possible run-lengths. The green and orange line correspond to the mean run-length which is inferred, respectively, by the BBCP and leaky models: Note that in the BBCP, while it takes some trials to detect switches, they are in general correctly identified (transitions between diagonal lines) and that integration is thus faster than for the leaky integrator, as illustrated by the inferred value of the probability bias. https://doi.org/10.1371/journal.pcbi.1007438.g002

With these premises, we define the BBCP as a prediction / update cycle which connects nodes from the previous trial to that at the current trial. Indeed, we will predict the probability at each node, knowing either an initial prior, or its value on a previous trial. In particular, at the occurrence of the first trial, we know for certain that there is a switch and initial beliefs are thus set to the values and ∀r > 0, . Then, at any trial t > 0, as we observe a new datum , we use a knowledge of at trial t − 1, the likelihood and the transition probabilities defined by the generative model to predict the beliefs over all nodes: (5) In the computational graph, Eq 5 corresponds to a message passing from the nodes at time t − 1 to that at time t. We will now detail how to compute the transition probabilities and the likelihood.

First, knowing that the data is generated by the Binary Switching model (see Eq 1), the run-length is either null at the moment of a switch, or its length (in number of trials) is incremented by 1 if no switch occurred: (6) This may be illustrated by a graph in which information will be represented at the different nodes for each trial t. This defines the transition matrix Pr(rt|rt−1) as a partition in two exclusive possibilities: Either there was a switch or not. It allows us to compute the growth probability for each run-length. On the one hand, the belief of an increment of the run-length at the next trial is: (7) where h is the scalar defining the hazard rate. On the other hand, it also allows to express the change-point probability as: (8) with B such that . Note that and thus . Knowing this probability strength and the previous value of the prediction, we can therefore make a prediction for our belief of the probability bias at the next trial t+ 1, prior to the observation of a new datum and resume the prediction / update cycle (see Eqs 4, 7 and 8).

Integrated in our cycle, we update beliefs on all nodes by computing the likelihood of the current datum knowing the current belief at each node, that is, based on observations from trials 0 to t − 1. A major algorithmic difference with the BCP model [58], is that here, the observed data is a Bernoulli trial and not a Gaussian random variable. The random variable is the probability bias used to generate the sequence of events . We will infer it for all different hypotheses on rt, that is, knowing there was a sequence of rt Bernoulli trials with a fixed probability bias in that epoch. Such an hypothesis will allow us to compute the distribution by a simple parameterization. Mathematically, a belief on the random variable is represented by the conjugate probability distribution of the binomial distribution, that is, by the beta-distribution . It is parameterized here by its sufficient statistics, the mean and sample size (see Appendix for our choice of parameterization). First, at the occurrence of a switch (for the node rt = 0) beliefs are set to prior values (before observing any datum): and . By recurrence, one can show that at any trial t > 0, the sufficient statistics can be updated from the previous trial following: (9) As a consequence, is the sample size corrected by the initial condition, that is, . For the mean, the series defined by is the average at trial t over the r + 1 last samples, which can also be written in a recursive fashion: (10) This updates for each node the sufficient statistics of the probability density function at the current trial.

We can now detail the computation of the likelihood of the current datum with respect to the current beliefs: . This scalar is returned by the binary function which evaluates at each node r the likelihood of the parameters of each node whenever we observe a counterfactual alternative outcome o = 1 or o = 0 (respectively right or left) knowing a mean bias and a sample size . For each outcome, the likelihood of observing an occurrence of o, is the probability of a binomial random variable knowing an updated probability bias of , a number p ⋅ r + o of trials going to the right and a number (1 − p) ⋅ r + 1 − o of trials to the left. After some algebra, this defines the likelihood as: (11) with Z such that . The full derivation of this function is detailed in Appendix. This provides us with the likelihood function and finally the scalar value .

Finally, the agent infers at each trial the belief and parameters at each node and uses the marginal predictive probability (see Eq 4) as a readout. This probability bias is best predicted by its expected value as it is marginalized over all run-lengths: (12) Interestingly, it can be proven that if, instead of updating beliefs with Eqs 7 and 8, we set nodes’ beliefs to the constant vector , then the marginal probability is equal to that obtained with the leaky integrator (see Eq 2). This highlights again that, contrary to the leaky integrator, the BBCP model uses a dynamical model for the estimation of the volatility. Still, as for the latter, there is only one parameter which informs the BBCP model that the probability bias switches on average every τ trials. Moreover, note that the resulting operations (see Eqs 4, 7, 8, 11 and 12) which constitute the BBCP algorithm can be implemented online, that is, only the state at trial t and the new datum are sufficient to predict all probabilities for the next trial. In summary, this prediction/update cycle exactly inverts the binary switching generative model and constitutes the Binary Bayesian Change-Point (BBCP) detection model.