All the previous steps measure a relative goodness of fit. Does model A fit better than model B? However, before interpreting any results from a model, it is essential to ensure that the model actually usefully captures the data in an absolute sense. This step is called model validation, and should never be skipped: it is possible to fit a model, get high fit measures, and nevertheless completely miss the essence of the behavior.

One method for model validation is computing the average trial likelihood as an absolute measure of fit. Although this measure has some nice properties — for example, the best possible value is one when the model predicts behavior perfectly — it offers limited value when choices are actually stochastic (which may be the case in many situations; Drugowitsch et al., 2016) or the environment is complex. In these cases, the best possible likelihood per trial is less than 1, but it is not known what the best possible likelihood per trial could be. For this reason, although the likelihood per trial can be a useful tool for model validation (Leong et al., 2017), interpreting it as an absolute measure of model fit is of limited value.

A better method to validate a model is to simulate it with the fit parameter values (Palminteri et al., 2017; Nassar and Frank, 2016; Navarro, 2019), a procedure long performed by statisticians as part of the ‘posterior predictive check’ (Roecker, 1991; Gelman et al., 1996). You should then analyze the simulated data in the same way that you analyzed the empirical data, to verify that all important behavioral effects are qualitatively and quantitatively captured by the simulations with the fit parameters. For example, if you observe a qualitative difference between two conditions empirically, the model should reproduce it. Likewise, if a learning curve reaches a quantitative asymptote of 0.7, simulations shouldn’t reach a vastly different one.

Some researchers analyze the posterior prediction of the model conditioned on the past history, instead of simulated data. In our previous notation, they evaluate the likelihood of choice c t given past data, d 1 : t - 1 , where the past data includes choices made by the subject, not choices made by the model, p ⁢ ( c t | d 1 : t - 1 , s t , 𝜽 m , m ) . In some cases, this approach leads to very similar results to simulations, because simulations sample choices on the basis of a very similar probability, where the past data, d 1 : t - 1 , include choices made by the model. However, it can also be dramatically different if the path of actions sampled by the participant is widely different from the paths likely to be selected by the model (leading to very different past histories).

Palminteri and colleagues (Palminteri et al., 2017) offer a striking example of this effect, where Model A fits better than Model B by any quantitative measure of model comparison, but is completely unable to capture the essence of the behavior. In their example, data are generated with a reinforcement learning agent (which takes the place of the subject) on a reversal learning task (where a choice that was previously good becomes bad, and reciprocally). These data are then fit with either a win-stay lose-shift model (model B), or a simplistic choice kernel model, which assumes that previous choices tend to be repeated (model A). Because of the autocorrelation in the choices made by the reinforcement learning agent, model A, which tends to repeat previous actions, fits better than model B, whose win-stay-lose-shift choices only depend on the action and outcome from the last trial. However, model A is completely insensitive to reward, and thus is unable to generate a reversal behavior when it is simulated with the fit model parameters. Thus, in this case, model A should be discarded, despite a greater quantitative fit. Nevertheless, the fact that the best validating model B captures less variance than model A should serve as a warning that model B is missing crucial components of the data and that a better model probably exists. This should incite the researcher to go back to the drawing board to develop a better model, for example one that combines elements of both models or a different model entirely, and perhaps a better experiment to test it.

More generally, if your validation step fails, you should go back to the drawing board! This may involve looking for a better model, as well as redesigning the task. Be careful interpreting results from a model that is not well validated! Of course, exactly what it means for a model to ‘fail’ the validation step is not well defined: no model is perfect, and there is no rule of thumb to tell us when a model is good enough. The most important aspect of validation is for you (and your readers) to be aware of its limitations, and in which ways they may influence any downstream results.