Participants

Participants were recruited from a student population via mailing lists of local universities. Twenty-one participants participated in the behavioral study (15 female; mean ± standard deviation (SD): age = 25.8 ± 4.3 years). Thirty participants were recruited for the fMRI study. Two had to be excluded (one due technical problems with the scanner and one due to excessive motion >4 mm). This resulted in a final sample of 28 participants (13 female; age = 23.5 ± 3.9 years) for the fMRI study. Due to time constraints, one participant in the fMRI sample only performed eight out of ten sessions and another only performed nine out of ten sessions. Participants were paid a show-up fee (behavioral sample: CHF 20; fMRI sample: CHF 50) plus a variable amount (see Instructions and task). Both groups performed exactly the same task with the only difference that one group underwent MR scanning.

The study was conducted in accord with the Declaration of Helsinki and approved by the governmental research ethics committee (Kantonale Ethikkommission Zürich, KEK-ZH-Nr. 2013-0328). All participants gave written informed consent using a form approved by the ethics committee.

Mathematical framework of virtual foraging task

To probe sequential decision-making, we propose a toy scenario and a corresponding Markov decision process (MDP) for a hunter-gatherer or any foraging agent that aims at dynamically maintaining homeostasis over time. See Fig. 1 for an example trial and Fig. 2 for the corresponding transition matrix, and the associated state transition diagram. See Table 1 and Table 2 for lists of all variables described in this and the following sections. The decision-making agent has to keep its internal energy state s above zero, i.e., the agent “dies from starvation” upon reaching the energy state zero at any time step. Here, the energy state can have discrete values equaling one to five energy points (but our model easily extends to additional numbers of energy states without loss of generality). At each time step t, or “day,” the agent can chose to “wait” and incur a sure loss c w (of one energy point) or it can “forage” in which case the agent probabilistically gains an amount g (of zero to four energy points) or incurs a cost for foraging c f (of two energy points). We denote the probability of foraging success as p (i.e., the probability of gaining points during foraging). The probability of unsuccessful foraging and thus of incurring c f is q = 1−p. The maximum energy is capped; in the highest energy state the agent cannot gain more but simply stays in the highest state if foraging is successful.

The agent “lives” within a given “forest” in which all relevant variables are specified (presented to participants during the “forest phase,” see Fig. 1). We included good and bad environmental conditions denoted as “good and bad weather types” that each occur with a probability of 0.5 on a given time step in a given forest. That is, per forest there are total of 12 states in the MDP: 6 (energy states) × 2 (weather types).

A core component of an MDP is the transition matrix between these different states (see Fig. 2 for one example). Starvation, i.e., the state of having zero energy is absorbing (and thus the entry in the transition matrix leading from energy state zero to energy state zero is 1). Waiting leads to a sure loss c w of one point, which is formalized by the associated probabilities w in the off-diagonal below the main diagonal. Foraging can lead to either success (fp) or not (fq). These probabilities constitute the entries in the associated off-diagonals (depending on the magnitudes of g and c f ).

How should the agent’s optimal policy look? The agent should minimize the probability of starvation p starve (n,s,t); i.e., it should minimize the probability of reaching zero energy points within a fixed and finite time horizon of n days when starting with the internal energy state s. In our case, the instructed finite time horizon n is always 5 days and the starting energy state s at the first day can be 2, 3, or 4 energy points. Participants were incentivized accordingly, i.e., they received a monetary payoff (for a random subset of forests), if their energy was above zero at day 5 and nothing otherwise. This corresponds to a simple implementation of a reward function within the framework of MDPs: All transitions to zero are associated with a “reward” of −1 and all other transitions with a reward of 0.

Starvation probability p starve (n,s,t) and thus the optimal policy depend on the agent’s choice at the current time step t and the next n−1 time steps—in addition to the dependence on n and s. In our finite-horizon scenario, the starvation probability thus depends on the number of time steps t, i.e., days within a forest.

A rough and basic intuition for a more or less prototypical optimal policy under such circumstances is that in many types of forests the agent should forage when the weather is good and wait when the weather is bad; unless waiting leads to sure death on the next time step. When the energy state is high enough so that starvation is impossible within the remaining number of days (e.g., energy state is 4 and only 1 day is left of the 5 days within a forest), the agent is indifferent between the two choice options.

Derivation of optimal policy in Markov decision process

We derived the optimal policies analytically for a large number of different forests, i.e., for all combinations of p (from 0.1 to 0.9 in steps of 0.1) and gains g (from 0 to 4 in steps of 1) for a finite time horizon of n = 5 days. In our MDP, the possible policies consist of the probabilities of foraging f in each energy state s, environmental condition (“weather”), and each time step t (with the probabilities of waiting w = 1−f). To derive optimal policies in our finite-horizon scenario, we used backward induction: That is, we started from the final time step (i.e., day 5) and calculated the values of the two choice options (i.e., foraging or waiting) for each state for that last and final time step. These values depend on the possible transitions from the respective states (see the example transition matrix in Fig. 2a and the corresponding example state transition diagram in Fig. 2b). If the value of foraging is higher than the value of waiting, foraging is the deterministically better option in that state and at that time step—or vice versa. If both choice options have the same value, the optimal choice is indifferent between the two options. We then used the maximum of the two choice options’ values to calculate the values for the second-to-last time step and determined optimal choices. This procedure was repeated until arriving at the first time step (i.e., day 1).

The optimal policy thus depends on the time horizon considered. Our task was designed such that participants should use a time horizon of 5 time steps and participants were instructed and incentivized accordingly. Nevertheless, it is a possibility that participants use a different time horizon. Therefore, we calculated optimal policies for the following time horizons: n = 7, n = 6, …, n = 1. Participants may employ a longer (i.e., 7 or 6 days) or shorter (i.e., 4, 3, 2, or 1 days) time horizon than instructed. Time horizons are flexible in the sense that it is assumed that participants start in a new forest with a given time horizon n and then reduce their horizon on each day by one. If the horizon has reached one (i.e., n = 1), it will remain one. Supplementary Fig. 10 provides an illustration of how the proportions of actions prescribed by the optimal policy change according to different finite horizons.

Since the optimal policies binarize the value differences over the two choice options (either foraging or waiting is better, or they are exactly the same), they do not allow for variability in the decision process (i.e., in some cases waiting and foraging entail large value differences whereas in other cases the two choice options have quite similar values). We therefore used the continuous value differences between the two choice options as predictors of participants’ choices, RTs, and fMRI data. For brevity, we often use the term “optimal policy” to refer to the value differences between the foraging and waiting (according to the normative horizon of 5 steps in our task).

All calculations were carried out in MATLAB.

Instructions and task

Participants received detailed written task instructions, in which the different task components were introduced step by step (Supplementary Note 1). For each mini-block of trials called “forest,” participants were first presented with a screen that depicted the foraging options for the two weather types (3.5 s; Fig. 1). Foraging options were illustrated as a grid with ten subfields. Each of the ten subfields had a probability of 0.1 to become realized. The number of colored dots within the subfields indicated the foraging costs (always 2 points) or foraging gains (ranging from 0 to 4). Sides were counterbalanced for the two weather types. An energy bar at the bottom of the screen depicted the starting energy state (which was 2, 3, or 4 points). After a fixation period (3.5 s), participants had to choose between one of the two foraging options (which represented either good or bad weather) and waiting, which was symbolized by a single dot depicting the sure energy loss of one point. Sides were counterbalanced for the two options. Participants had a maximum of 2 s to make their choice, which was then indicated by an asterisk. If they failed to respond or pressed a wrong button, the words “Too slow or wrong button” appeared. After an interval of 1 s, participants saw the outcome of their choice (1 s): If they had foraged one of the ten subfields turned yellow and the corresponding energy points were added to or subtracted from their energy bar. If they had waited, one point was subtracted from the energy bar. After a variable fixation interval (between 0.5 and 3.8 s), a new day or a new forest was depicted. Participants were not explicitly cued about the current day within a forest; but they knew that they always remained a maximum of five days within a forest (so they could count down the number of days when entering a new forest). If participants had died from starvation, an empty energy bar was depicted and they were asked to press one of the two buttons to elicit a motor response. The task was presented using the MATLAB toolbox Cogent (www.vislab.ucl.ac.uk).

Participants performed a short first training session of four forests with five days. To ensure that participants could get accustomed to looking five days ahead, they performed a second training session, in which they remained for five days in each of 24 forests. Participants performed ten sessions of the actual experimental task either on a PC (behavioral sample) or in the MR scanner (fMRI sample).

Participants were incentivized to avoid starvation over five days in a forest. To strike a balance between design efficiency and the requirement to make participants integrate over a number of future time steps, the number of days in a forest followed an exponential distribution with a mean of 2.5 resulting in 40 days in 24 forests per session. Importantly, participants were instructed to always consider staying five days in a forest and were monetarily incentivized accordingly. In the end, we randomly selected one of the 24 forests for each of the ten sessions. To determine participants’ payment, they were again presented with the two weather types and their current energy state and completed the 5 days within these forests. For each forest in which they survived (regardless of the precise energy state) they received additional CHF 1.50.

Analyses of choice data and model comparison procedures

Of the total possible number of 400 days, participants in the behavioral sample had reached the starvation state on 22.2 ± 6.0 trials (behavioral sample) and 21.1 ± 5.9 trials (fMRI sample). In the remaining trials they did not make a response within the time limit in 8.6 ± 10.7 trials (behavioral sample) and 6.4 ± 8.6 trials (fMRI sample). This left 369.1 ± 13.0 trials (behavioral sample) and 368.3 ± 18.8 trials (fMRI sample) for analyses.

We used logistic regression models (implemented in the MATLAB function mnrfit) of the following form:

$$p_{{\mathrm{forage}}} = \frac{1}{{1 + e^{ - {\rm DV}}}},$$

with the following form of the decision variable DV:

$${\rm DV} = \beta _0 + \beta _1 \ast {\rm policy}.$$

As policy, we first considered the following potential optimal or heuristic policies (see Table 1 and Table 2 for annotated lists): (1) the optimal policy according to a time horizon of 5 days (h-5). In contrast to the optimal policy, the following variables should be regarded as heuristics since they do no necessitate integration over the relevant future time points: (2) momentary probability of foraging success p; (3) momentary magnitude of foraging gain g; (4) expected value (EV) of foraging; (5) current continuous energy state s (ranging from 1 to 5 points); and (6) binary energy state (indicating whether waiting would lead to sure death or not). Since waiting leads to sure death when the continuous energy state is one, the optimal policy never prescribes waiting in these situations (i.e., even when the probability of foraging success is small, the optimal policy always favors a non-zero starvation probability). Therefore, at a continuous energy state of one the prescriptions of the binary energy state variable are always in line with the optimal policy. We also included the following variables: (7) current weather type; (8) number of days past in a forest; (9) change between past and current energy states (within and across forests); and (10) a type of “win-stay-lose-shift” (WSLS) strategy, which prescribes foraging if the energy state increased in the past trial (i.e., after a foraging win) and waiting if the energy state decreased (i.e., after waiting and after a foraging loss). In other words, WSLS is a binarized version of the change between past and current energy states. Additionally, we tested for a myopic optimal policy: (11) the optimal policy according to a time horizon of 1 day (h-1).

Since the probability of foraging success emerged as best single predictor of participants’ choices, we tested whether any other decision variable explained remaining variance in models that included the probability of foraging success, p, as first predictor and one of the other policies as second predictor (Supplementary Tables 1, 2):

$${\rm DV} = \beta _0 + \beta _1 \ast p + \beta _2 \ast {\rm policy}.$$

According to the same logic, we tested all possible models with two predictors (Supplementary Tables 3, 4) and additionally models with three predictors (Supplementary Tables 1, 2). We also tested interaction models of the general form (Supplementary Tables 1, 2):

$${\rm DV} = \beta _0 + \beta _1 \ast p + \beta _2 \ast {\rm policy} + \beta _3 \ast p \ast {\rm policy}.$$

For each model we approximated model evidence by calculating the Bayesian Information Criterion (BIC), which penalizes model complexity. We performed both fixed-effects and random-effects analyses. The latter assume that different participants may use different models. We used the Bayesian model selection (BMS) procedure implemented in SPM12 (http://www.fil.ion.ucl.ac.uk/spm/) to calculate protected exceedance probabilities, which measure how likely it is that any given model is more frequent than all other models in the population53.

Calculation of variables related to participants’ choices

Both RTs and brain activity may also relate to the choice uncertainties of the employed decision variables. Choice uncertainties can be quantified by the logistic functions that relate participants’ empirical choices to the decision variables (cf. Fig. 3; Supplementary Fig. 3). That is, choice uncertainty is higher at the inflection point of the logistic function (i.e., at the point at which participants are indifferent between the two choice options) than at the ranges of the decision variable where the values of the logistic function are close to zero (prescribing waiting) or close to one (prescribing foraging). As metrics of choice uncertainties, we therefore took the derivatives the logistic functions related to the two variables included in the winning model of participants’ choice: the probabilities of foraging success and the value differences according to the optimal policy (with a horizon of 5 days). To base the quantification of choice uncertainties for the fMRI sample on independent data, we used the mean parameter estimates of the choice models from the behavioral sample to derive the logistic functions.

To investigate the interplay between the two variables of the winning choice model, we were specifically interested in the discrepancies in the prescriptions of the two decision variables. Therefore, we took the absolute differences between the predicted choices under the heuristic of using the probabilities of foraging success and under the value differences according to the optimal policy. That is, we calculated the absolute differences between the points on the logistic functions derived from the mean parameter estimates of the choice models from the behavioral sample. We refer to this metric as “discrepancy” between the two policies.

We included choice uncertainties and discrepancies as trial-by-trial estimates in RT and fMRI models (see below). Additionally, we explored whether choice uncertainties and discrepancy guided the use of the decision variables included in the winning model. Specifically, we tested the following interaction models (Supplementary Tables 1, 2):

$${\rm DV} = \beta _0 + \beta _1 \ast p \ast (1 - {\rm choice\;uncertainty\;of}\;p) + \beta _2 \ast {\rm optimal\;policy}$$

$${\rm DV} = \, \beta _0 + \beta _1 \ast p + \beta _2 \ast {\rm optimal\;policy} \\ \ast (1 - {\rm choice\;uncertainty\;of\;optimal\;policy)}$$

$${\rm DV} = \, \beta _0 + \beta _1 \ast p \ast (1 - {\rm choice\;uncertainty\;of}\;p) + \beta _2 \ast {\rm optimal\;policy} \\ \ast (1 - {\rm choice\;uncertainty\;of\;optimal\;policy)}$$

$${\rm DV} = \beta _0 + \beta _1 \ast p \ast {\rm discrepancy} + \beta _2 \ast {\rm optimal\;policy}$$

$${\rm DV} = \beta _0 + \beta _1 \ast p + \beta _2 \ast {\rm optimal\;policy \ast discrepancy}.$$

For additional analyses, we used the relevant logistic functions derived from mean parameter estimates (from the behavioral sample) to select trials in which the heuristic policies of using the probability of foraging success and of using the EV made opposite prescriptions for choice. That is, we binarized the prescriptions of the two policies by splitting them into those above and below the midpoint of 0.5. Logistic functions were derived from the mean parameter estimates of the behavioral sample (Supplementary Note 2).

To analyze subsets of trials, in which different horizons of the optimal policy made opposing prescriptions, we simply selected trials in which the (a priori computable) optimal policy for a horizon of 5 time steps prescribed foraging and the optimal policy for another time horizon prescribed waiting; or vice versa (Supplementary Table 5; Supplementary Fig. 10).

Models of RTs

We analyzed log-transformed RTs using linear mixed effects models as implemented in the R package lmer54 (http://cran.r-project.org/web/packages/lme4/index.html). The independent variables in the mixed effects model were the probabilities of foraging success, the value differences of foraging vs. waiting according to the optimal policy (with a horizon of five days), the choice uncertainties of the two policies, and the discrepancies between the prescriptions of the two policies. Random effects for participants included a random intercept and random slopes for all variables. Since we had no hypotheses for interactions, we did not include any interaction terms as fixed- or random-effects. Significance levels of the fixed-effects were determined by performing log-likelihood tests, which compared the full model to models without the respective factor (Supplementary Table 6). Additionally, we performed the same analyses using untransformed RTs, which resulted in qualitatively similar results (mean ± SD of the skewness of the RT distributions: behavioral sample: untransformed 0.751 ± 0.404; log-transformed: −0.067 ± 0.928; fMRI sample: untransformed: 0.873 ± 0.335; log-transformed −0.038 ± 0.843).

To compare RTs in a subset of trials in which the heuristic policy of using the probabilities of foraging success and the value differences according to the optimal policy made opposing prescriptions, we binarized the prescriptions of the two policies by splitting them into those above and below the midpoint of 0.5. Logistic functions were derived from the mean parameter estimates of the behavioral sample. We then selected trials in which the two policies differed (mean ± SD of resulting number of trials per participant: fMRI sample: 70.4 ± 8.1; behavioral sample: 70.4 ± 6.8; the minimum number of trials included per participant was 47 trials).

MRI data acquisition

Data were recorded in a 3 T (Philips Achieva, Best, The Netherlands) MRI scanner using a 32-channel head coil. Functional images were recorded using a T2*-weighted echo-planar imaging (EPI) sequence (TR 2.1 s; TE 30 ms; flip angle 80°). A total of 37 axial slices were sampled for whole brain coverage (matrix size 96 × 96; in-plane resolution 2.5 × 2.5 mm2; slice thickness 2.8 mm; 0.5 mm gap between slices; slice tilt 0°). Imaging data were acquired in ten runs of 170 volumes each. The first five volumes of each run were discarded to obtain steady-state longitudinal magnetization. Field maps were acquired with a double echo gradient echo field map sequence, using 32 slices covering the whole head (TR 349.11 ms; TE 4.099 and 7.099 ms; matrix size, 80 × 80; in-plane resolution 3 × 3 mm2; slice thickness 3 mm; 0.5 mm gap between slices; slice tilt 0°). Anatomical images were acquired using a T1-weighted scan (FoV 255 × 255 × 180 mm; voxel size 1 × 1 × 1 mm3).

Preprocessing and nuisance regressors

All fMRI analyses were performed in SPM12. The FieldMap toolbox was used to correct for geometric distortions caused by susceptibility-induced field inhomogeneities. Preprocessing of EPI data included rigid-body realignment to correct for head movement, unwarping, and slice time correction. EPI images were then coregistered to the individual’s T1 weighted image using a 12-parameter affine transformation and normalized to the Montreal Neurological Institute (MNI) T1 reference brain template. Normalized images were smoothed with an isotropic 8 mm full width at half-maximum Gaussian kernel.

We corrected for physiological noise using RETROICOR as implemented the MATLAB PhysIO toolbox55, 56 (open source code available as part of the TAPAS software collection: http://www.translationalneuromodeling.org/tapas/). We collected electrocardiogram (ECG), pulse oximeter, and breathing belt data during scanning. After quality checks, we used ECG and breathing belt data for noise correction in 19 participants, pulse oximeter and breathing belt data in two participants, and no physiological noise correction in six participants. The corresponding confound regressors as well as the six motion correction parameters estimated from the realignment procedure were entered as covariates of no interest. Regressors were convolved with the canonical HRF and low frequency drifts were excluded using a high-pass filter with a 128 s cutoff.

General linear models

The three distinct phases of the task (forest, choice, and outcome phases; see Fig. 1) were entered as events with a duration of 0 s (i.e., as stick functions) into the GLMs. Choice and outcome phases in which participants had starved or for which they did not reply were not explicitly modeled.

We were mostly interested in the choice phase and ran a main GLM with a combination of variables that emerged in our analyses of behavioral and RT data. Specifically, the following variables were entered on a trial-by-trial basis as parametric modulators of the choice phase: The probabilities of foraging success, the value differences of foraging vs. waiting according to the optimal policy (with a horizon of 5 days), the choice uncertainties of the two policies, the discrepancies between the prescriptions of the two policies, and RTs. A second GLM included participants’ choices (as a binary variable) as a parametric modulator in addition to the variables entered in the main GLM. A third GLM only included participants’ choices as a single parametric modulator. A fourth GLM included parametric modulators in terms of chosen (and unchosen) options (and not in terms of the presented options as in the first three GLMs). Specifically, this fourth GLM contained the a parametric modulator for the chosen values of the employed heuristic (i.e., the current probability of foraging success for trials in which the foraging option was chosen and a probability of zero when the waiting option was chosen) and a parametric modulator for the chosen values according to the optimal policy (i.e., the current values of the foraging or the waiting options) as well as parametric modulators for the corresponding unchosen values along with parametric modulators for participants’ choices and RTs.

In all four GLMs, the forest phase was parametrically modulated by the current energy state and the overall starvation probability across five days. The outcome phase was parametrically modulated by the change in energy state.

We report analyses in which regressors competed for variance (i.e., without serial orthogonalization). In the main GLM, the maximum average shared variance was below 0.5 for all combinations of regressors.

We performed a second-level one-sample t-test on contrast images from all participants. All reported clusters are family-wise error (FWE) corrected for multiple comparisons at p < 0.05 using the SPM random field theory based approach. The cluster-defining threshold was p < 0.001. At this voxel-inclusion threshold the random-field theory approach in SPM correctly controls the false positive rate57.

Data availability

The behavioral data that support the findings of this study are publicly available at https://figshare.com/s/1a4d75cb4176a3fef040. The neuroimaging data that support the findings of this study are publicly available at https://neurovault.org/collections/3242/.