Participants

Experimental procedures were approved by the University of New South Wales Human Research Ethics Committee (HREC#: HC12030). All methods in this study were performed in accordance with the guidelines and regulations from the Australian National Statement on Ethical Conduct in Human Research (https://www.nhmrc.gov.au/guidelines-publications/e72). All participants gave informed consent to participate in the experiment. For the fMRI experiment, we tested 14 participants (9 females, aged 29.1 ± 1.1 years old, mean ± SEM). We selected the sample size based on both estimations of effect sizes and the number of participants used in previous studies employing decoding to track brain signals predictive of subsequent decisions7,8,9. Previous works tested from 8 to 14 participants, we thus used the participant’s number upper bound in order to maximize the reliability of the results. We performed power analyses to corroborate that this number of participants was adequate to achieve a power of at least 0.8. Based on effect size estimations using G*Power 346. Soon at al. study on the pre volitional determinants of decision making9 tested 14 participants achieving a power of 0.812 in the time resolved decoding analysis while Bannert and Bartels study on perception-imagery cross-decoding generalization tested 830. Post hoc effect size analysis revealed that they would have needed to test 12 participants to achieve a power of 0.8. For the behavioral free decision and cued imagery priming task, we invited all the previous 14 participants to be part in this psychophysics experiment. Only 8 participants (4 females, aged 29.3 ± 0.5 years old), were able to come back to complete this new experiment.

fMRI free decision visual imagery task

We instructed participants to choose between two predefined gratings (horizontal green/vertical red or vertical green/horizontal red, counterbalanced across participants), which were previously familiar to the participants through prior training sessions. We asked the participants to refrain from following preconceived decision schemes. In the scanner, participants were provided with two dual-button boxes, one held in each hand. Each trial started with a prompt reading: “take your time to choose – press right button” for 2 seconds (Fig. 1). After this, a screen containing a fixation point was shown while the decision as to what to think of was made. This period is referred as “pre-imagery time” and was limited to 20 seconds. Participants were instructed to settle their mind before deciding. Participants pressed a button with the right hand as soon as they decided which grating to imagine. Participants reported that in some trials they felt in control of their decision, whereas in other trials one of the gratings just “popped-out” in their mind. Importantly, participants were instructed to press the button as soon as possible when they reached the decision or a grating appeared in their mind. After pressing the button, the fixation point became brighter for 100 ms indicating the participants that the imagery onset time was recorded. During the imagery period (10 seconds), participants were instructed to imagine the chosen pattern as vividly as possible, trying, if possible, to project it onto the screen. At the end of the imagery period, a question appeared on the screen: “what did you imagine? – Left for vertical green/red – Right for horizontal red/green” (depending on the pre-assigned patterns for the participant). After giving the answer, a second question appeared: “how vivid was it? – 1 (low) to 4 (high)” to which participants answered using 4 different buttons. After each trial, there was a blank interval of 10 seconds where we instructed the participants to just relax and try not to think about the gratings nor any subsequent decisions. Systematic post-experiment interviews revealed that some participants (n = 4) could not help thinking about gratings in some trials during the inter trial interval. They reported different strategies to avoid these thoughts such as ignoring them, replacing them for another image/thought, or choosing the other grating when the decision came. The remaining participants (n = 10) reported not having any thoughts or mental images about gratings during the rest period. We tested if the effects we found could be explained by the former group of participants who could not refrain from thinking about gratings. We thus performed the analysis using only data from the participants who did not think/imagine gratings outside the imagery period (n = 10). Fig. S10 shows the results of this control. Results are comparable to those shown in Fig. 2, thus ruling out the possibility that that the effects we report were driven by the 4 participants who had thoughts about gratings in the rest period. We delivered the task in runs of 5 minutes during which the participants completed as many trials as possible. Participants chose to imagine horizontal and vertical gratings with a similar probability (50.44% versus 49.56% for vertical and horizontal gratings respectively, mean Shannon entropy = 0.997 ± 0.001 SEM) and showed an average probability of switching gratings from one trial to the next of 58.59% ±2.81 SEM. Participants completed in average 7.07 runs each, with each run containing an average of 9.2 trials.

Behavioral imagery onset reliability experiment

Since the self-report of the onset of decisions has been criticized due to its unreliability and unknown variance17, we developed a novel independent psychophysics experiment to test its reliability. We objectively measured imagery strength as a function of time for a subset of the participants from the fMRI experiment. Importantly, the results of this experiment revealed that the reported onsets of decisions are indeed reliable relative to the temporal resolution of the fMRI (Fig. 3).

We employed two conditions: free decision (freely chosen imagined stimulus and imagery onset) and cued (i.e., imposed imagined stimuli and imagery onset), see Fig. 3A for a schematic of the paradigm. We used binocular rivalry priming as a means to objectively measure sensory imagery strength18,47,48. When imagining one of the two competing rivalry stimuli prior to a binocular rivalry presentation, rivalry perception is biased towards the imagined stimulus, with greater levels of priming as the imagery time increases18; see18,28 for discussion of why this is an objective measure of imagery strength and not visual attention, binocular rivalry control or response bias. We asked participants to imagine one of the rivalry gratings for different durations and then measured rivalry priming as a function of the different imagery durations (Fig. 3B). We reasoned that if participants reported the onset of imagery a few seconds after they actually started imagining, this would be detected as an increase in priming compared to the condition where the onset of imagery is controlled by the experimenter. Thus, in the free decision condition, participants had to freely choose to imagine one of the two predefined gratings (horizontal green/vertical red or vertical green/horizontal red, counterbalanced across participants). In the cued condition, participants were presented with a cue indicating which grating to imagine, thus imposing the onset of imagery as well as which grating needed to be imagined. Each trial started with the instruction “press spacebar to start the trial” (Fig. 3A). Then, either the instruction “CHOOSE” or a cue indicating which grating to imagine (i.e., “horizontal red”) was presented for 1 second. In the free decision condition, the imagery time started after the participant chose the grating to imagine, which they indicated by pressing a key on the computer keyboard (Fig. 3A). For the cued imagery condition, the imagery time started right after the cue was gone (i.e., no decision time). We tested 3 imagery times (3.33, 6.67 and 10 seconds). After the imagery time, a high pitch sound was delivered (200 ms) and both gratings were presented through red/green stereo glasses at fixation for 700 ms. Then, participants had to report which grating was dominant (i.e., horizontal red, vertical green or mixed if no grating was dominant), by pressing different keys. After this, they had to answer which grating they imagined (for both free decision and cued trials). Participants then rated their imagery vividness from 1 (low) to 4 (high) by pressing one of the 4 buttons in their response boxes. Free decision and cued trials as well as imagery times were pseudo-randomized within a block of 30 trials. We added catch trials (20%) in which the gratings were physically fused and equally dominant to control the reliability of self-report18,49. We tested 120 trials for each free decision and cued imagery conditions (40 trials per time point), plus 48 catch trials evenly divided among time points.

Raw priming values were calculated as the number of congruent dominant gratings in binocular rivalry (e.g., imagined vertical led to vertical dominant in binocular rivalry) divided by the total number of trials excluding mixed dominance binocular (piecemeal), for each time point and condition independently. Raw vividness values were calculated as the average per time point and condition excluding mixed perception trials. Priming and vividness were normalized as z-score within participants and across time-points and conditions to account for baseline differences across participants, but otherwise conserving relative differences amongst conditions and time-points. Rivalry dominance self-report reliability was verified with fake rivalry catch trials, where gratings were physically fused and equally dominant, which were reported as mixed above chance level (83.8%, p = 0.002, one-sample t-test against baseline). Priming and vividness z-scores were subjected to a one-way ANOVA to detect main the effects of conditions. We also performed post-hoc two-sample t-tests to verify that priming and vividness scores differed significantly between time points (Fig. 3).

We tested this independent behavioral experiment on 8 participants from the fMRI experiment (all 14 original participants were invited but only 8 were able to come back), who had extensive experience as subjects in psychophysics experiments. We further sought to test if these results would generalize to completely inexperienced participants who did not participate in the fMRI experiment (N = 10). We did not, however, find a significant increase of priming or vividness as a function of time as for results on Fig. S11, suggesting that this is a highly demanding task and experience in psychophysics might be important to perform the task properly (i.e., being able to hold the mental image for the duration of the imagery time).

Functional and structural MRI parameters

Scans were performed at the Neuroscience Research Australia (NeuRA) facility, Sydney, Australia, in a Philips 3T Achieva TX MRI scanner using a 32-channel head coil. Structural images were acquired using turbo field echo (TFE) sequence consisting in 256 T1-weighted sagittal slices covering the whole brain (flip angle 8 deg, matrix size = 256 × 256, voxel size = 1 mm isotropic). Functional T2*-weighted images were acquired using echo planar imaging (EPI) sequence, with 31 slices (flip angle = 90 deg, matrix size = 240 × 240, voxel size = 3 mm isotropic, TR = 2000ms, TE = 40 ms).

fMRI perception condition

We presented counter-phase flickering gratings at 4.167 Hz (70% contrast, ~0.5 degrees of visual angle per cycle). They were presented at their respective predefined colors and orientations (horizontal green/vertical red or vertical green/horizontal red). The gratings were convolved with a Gaussian-like 2D kernel to obtain smooth-edged circular gratings. Gratings were presented inside a rectangle (the same that was used in the imagery task, Fig. 1) and a fixation point was drawn at the center (as for the imagery task). Within a run of 3 minutes, we presented the flickering patterns in a block manner, interleaved with fixation periods (15 seconds each). Importantly, an attention task was performed consisting of detecting a change in fixation point brightness (+70% for 200 ms). Fixation changes were allocated randomly during a run, from 1 to 4 instances. Participants were instructed to press any of the 4 buttons as soon as they detected the changes. Participants showed high performance in the detection task (d-prime = 3.33 ± 0.13 SEM).

Functional mapping of retinotopic visual areas

To functionally determine the boundaries of visual areas from V1 to V4 independently for each participant, we used the phase-encoding method50,51. Double wedges containing dynamic colored patterns cycled through 10 rotations in 10 min (retinotopic stimulation frequency = 0.033 Hz). To ensure deployment of attention to the stimulus during the mapping, participants performed a detection task: pressing a button upon seeing a gray dot anywhere on the wedges.

Experimental procedures

We performed the 3 experiments in a single scanning session lasting about 1.5 h. Stimuli were delivered using an 18” MRI-compatible LCD screen (Philips ERD-2, 60 Hz refresh rate) located at the end of the bore. All stimuli were delivered and responses gathered employing the Psychtoolbox 352,53 for MATLAB (The MathWorks Inc., Natick, MA, USA) using in-house scripts. Participants’ heads were restrained using foam pads and adhesive tape. Each session followed the same structure: first the structural scanning followed by the retinotopic mapping. Then the perception task was alternated with the imagery task until completing 3 runs of the perception task. Then the imagery task was repeated until completing 7 or 8 (depending on the participant) runs in total. Pauses were assigned in between the runs. The 4 first volumes of each functional runs were discarded to account for the equilibrium magnetization time and each functional run started with 10 seconds of fixation.

Phase-encoded retinotopic mapping analysis

Functional MRI retinotopic mapping data were analyzed using the Fast-Fourier Transform (FFT) in MATLAB. The FFT was applied voxel-wise across time points. The complex output of the FFT contained both the amplitude and phase information of sinusoidal components of the BOLD signal. Phase information at the frequency of stimulation (0.033 Hz) was then extracted, using its amplitude as threshold (≥2 SNR) and overlaid them on each participant’s cortical surface reconstruction obtained using Freesurfer54,55. We manually delineated boundaries between retinotopic areas on the flattened surface around the occipital pole by identifying voxels showing phase reversals in the polar angle map, representing the horizontal and vertical visual meridians. In all participants, we clearly defined five distinct visual areas: V1, V2, V3d, V3v and V4; throughout this paper, we merge V3d and V3v and label them as V3. All four retinotopic labels were then defined as the intersection with the perceptual blocks (grating > fixation, p < 0.001, FDR corrected) thus restricting the ROI to the foveal representation of each visual area.

Functional MRI signal processing

All data were analyzed using SPM12 (Statistical Parametric Mapping; Wellcome Trust Centre for Neuroimaging, London, UK). We realigned functional images to the first functional volume and high-pass filtered (128 seconds) to remove low-frequency drifts in the signal, with no additional spatial smoothing. To estimate the hemodynamic response function (HRF), we generated regressors for each grating (horizontal green/vertical red or vertical green/horizontal red) for each run and experiment (perception and imagery) independently. We used finite-impulse response (FIR) as the basis function. This basis function makes no assumptions about the shape of the HRF which is important for the analysis of the free decision imagery data9. We employed a 14th order FIR basis function encompassing 28 seconds from −13 to +13 seconds from the imagery onset, thus obtaining 14 bins representing each TR. For the perception condition, we employed a 1st order FIR basis function from the onset of each perceptual block to its end (15 seconds). We also employed 1st order FIR basis functions for the sanity check imagery decoding (from 0 to 10 s, Fig. S2) and the before-after decision perception-imagery generalization (−10 to 0 and 0 to 10 from imagery decision, Fig. 5). For the vividness analysis, we split the trials into low-vividness (ratings 1 and 2) and high-vividness (ratings 3 and 4), we then obtained the regressors for both gratings as explained above.

Multi-voxel pattern analysis (MVPA)

We used a well-established decoding approach to extract information related to each grating contained in the pattern of activation across voxels of a given participant (in their “native” anatomical space) using the decoding toolbox (TDT)56. Using a leave-one-run out cross-validation scheme, we trained a L2-norm regularized linear supporting vector machine (SVM, as implemented in LIBSVM) on beta values using all but one run and then tested on the remaining one. No additional scaling (normalization) was performed on the data as beta values represent a scaled version of the data relative to the run mean. Training and testing was repeated until all runs were used as test and then averaged the results across validations (7 or 8-fold, depending on the participant). We performed leave-one-run out cross validation for every temporal bin independently.

We also employed cross-classification to generalize information between the perception and the imagery tasks in the “perception-imagery generalization”.

For the perception-imagery cross-classification, we trained on the ensemble of the perception runs and tested on the ensemble of the imagery runs. In each perception run, green and red gratings were shown pseudorandomly in 6 blocks of 15 s each. Perceptual blocks (15 s) were convolved with a 1st order FIR filter, yielding regressors for red and green perceptual gratings, as explained in the previous section. Imagery trials were pre-processed exactly as in the imagery decoding, yielding time-resolved (2 s) or block (10 s) regressors (see previous section for details). Thus, classifiers trained on the perceptual runs (e.g., perceptual vertical-green vs perceptual horizontal-red) were tested on the imagery data (e.g., imagined vertical-green vs imagined horizontal-red). Accuracy was calculated as in the imagery decoding (e.g., percentage of vertical-green vs horizontal-red decoding accuracy), except for that the training-testing procedure was performed only once (i.e., all perceptual data was used to train and all imagery data was used to test the classifiers), since it is not necessary to use cross-validation in such cross-classification schemes as the training and testing data are different and independent (as opposed to the imagery decoding condition where a fraction of the data was used for training and another for testing).

We employed 2 different decoding approaches: searchlight and region-of-interest (ROI). We used a spherical searchlight of 3 voxels of radius and obtained volumes in which a value of decoding accuracy was assigned to each voxel. We normalized the decoding accuracy volumes into the MNI space and applied a spatial smoothing of 8 mm FWHM, which has been found to be optimal in order to account for anatomical idiosyncrasies across participants57. We then performed a one-tail one-sample t-test against 50% (chance level) across participants for every voxel. We corrected for multiple comparisons using cluster-extent based thresholding employing Gaussian Random Field theory58,59, as implemented in FSL60. We used a primary threshold of p < 0.001 at the voxel level, as recommended in previous studies61, and a cluster level threshold of p < 0.05 in every time point volume independently. Importantly, these thresholds have been shown to be valid within the nominal false positive ratios62.

ROI decoding was used to test information content in visual areas specifically. We defined the boundaries of visual areas from V1 to V4 which volumes were used as ROI. Note that because visual ROI were defined on the cortical surface (see phase-encoded retinotopic analysis for details), only gray-matter containing voxels were considered, as opposed to the searchlight approach which also considers non-gray matter containing voxels, potentially explaining differences on sensitivity between these approaches.

We tested if there was a difference in the average BOLD response between stimuli (i.e., univariate difference). We did not find any significant differences (p > 0.05, uncorrected) in the average BOLD response (Fig. S9), thus ruling-out the possibility that the results would be explained by differences in the average level of activity across conditions.

Permutation test

In order to validate the use of standard parametric statistics, we performed a permutation test and thus empirically determined the distribution of decoding accuracies under the null hypothesis63. Previous reports have highlighted the possibility of obtaining skewed decoding distributions, which would invalidate the use of standard parametric statistical tests29. We thus randomly shuffled the labels (i.e., horizontal red/vertical green) among trials and within blocks (i.e., number of red/green imagined trials was conserved within a run but trial labels were shuffled) for each participant and condition (imagery and generalization) to generate empirical data under the null hypothesis. After reshuffling the labels, we generated regressors for each stimulus and performed decoding following the same procedure described in the previous paragraph. We repeated this procedure 1000 times and obtained the empirical distribution under the null hypothesis. At each iteration, the second level analysis (across participants) consisted of averaging the results across participants (exactly as performed on the original data), from which we obtained confidence intervals for each decoding time point and area (Figs S4 and S6) using the percentile method63. Our results show that the decoding null hypothesis followed a normal distribution (Table S2) and importantly, significant results using permutation test confidence intervals were comparable to the results using standard parametric tests (compare significant points on Figs 2 and 3 with Figs S4 and S6). This analysis thus validates the use of standard statistical tests to test significance on our dataset.

Across time-points family-wise error rate (FWER) control

We estimated the probability of obtaining an n number of significantly above-chance decoding time points (p < 0.05, one tailed t-test) under the null hypothesis. To do this, we employed the data from the null distribution obtained with the permutation test (randomly shuffled labels, 1000 iterations; see previous paragraph for details). Fig. S3 shows the result of such analysis. Insets show the family-wise error rate for the empirically observed number above-chance decoding time points for each area.

Spillover effect (N-1) decoding control

We conducted a control analysis to directly test whether the searchlight results could be explained by any spill over from the previous trial, as performed in a previous study (Soon et al.19). To do this, we shifted the labels by one trial (N-1). Briefly, the rationale behind this control is the following: if there was spill over from the previous trial, this analysis should show higher decoding accuracy in the pre-imagery period as effects from the previous trial would spillover over the next trial (for a comprehensive explanation of the rationale please refer to Soon et al.19). All the decoding details were otherwise identical to what is described in the section “Multi-voxel pattern analysis (MVPA)” except for that the first trial of each run was not considered as there was no N-1 trial in that case. Analogously, for the perception-imagery generalization, training was performed on perception data and tested on imagery trials labeled as N-1.