Participants

Twenty-one right-handed native Spanish speakers took part in the first experiment and twenty-six in the second (aged between 19 and 40, mean 24; and 20 to 39, mean 25 respectively). From the first experiment one participant was excluded due to excessive noise in the recordings, and from the second, one was excluded due to lack of compliance with experimental instructions, two due to excessive noise and artifacts, and two due to technical problems with the audio system detected after the experiment. This left 20 participants in the first experiment, and 21 in the second. Their vision was normal or corrected to normal and they had no history of neurological disease. All participants provided informed consent in accord with the Declaration of Helsinki before starting the experiment and received €10 in exchange for their collaboration. The present study was approved by the BCBL Ethics Committee.

Stimuli and procedure

A set of 20 pictures depicting fricative- and stop-consonant-initial words in Spanish were selected from the Bank of Standardized Stimuli (BOSS46). The color pictures obtained were transformed to grayscale, trimmed, resized to a target diagonal of 500 pixels, and placed on a 550 by 550 pixel square with a medium-gray background using the imageMagick software package. The pictures were balanced across conditions with respect to several ratings from the BOSS database (name-, object-, and viewpoint-agreement, familiarity, subjective complexity and manipulability), as well as several lower level image properties (contour complexity, number of pixels, and brightness). These properties were evaluated with Matlab 2012b and imageMagick software packages. The corresponding words were 5–6 letters long and were balanced across conditions on frequency (obtained from the esPal database47), length in syllables, and semantic category (natural vs artifact). All words started with a consonant followed by a vowel (see Table 3 for full stimuli list). Words were also balanced in a set of phonological dimensions: number of phonemes (plosives: 5.6, SD: 0.52; fricatives: 5.4, SD: 0.90; p = 0.62), number of syllables (plosives: 2.7, SD: 0.48; fricatives: 2.5, SD: 0.52; p = 0.45), position of the accented syllable (plosives: 1.9, SD: 0.32; fricatives: 1.7, SD: 0.66; p = 0.45), phonological neighbors (none for both groups). An additional non-predictive condition was included in the experimental materials, by using scrambled pictures as cues.

Table 3 Items in each phoneme condition. English translations provided within parenthesis. Full size table

In total, 360 response trials were generated (120 in each condition: plosive-predictive, fricative-predictive, non-predictive). Trial order was pseudo-randomized, avoiding more than three repetitions in a row of the same image cue.

The picture was followed by an auditory word that was correctly pronounced in 50% of the trials, and incorrectly otherwise. In the latter case, the vowel following the first consonant was substituted by another vowel, always creating a pseudo-word. Three different mispronunciations were generated for each actual word when possible (for some words only one pseudo-word could be generated).

The auditory stimuli were created by recording a female native Spanish speaker reading aloud the target words (both the correct and mispronounced versions) in a sound-proof cabin. For each item multiple versions of the word were recorded following a random order, and one exemplar of each was chosen manually. The recordings were cut and equalized to 70 dB using Praat48.

Participants were instructed to evaluate whether the word was correctly pronounced or not and were informed that the incorrectly pronounced words had one phoneme replaced. They were encouraged to pay attention to the preceding images, explaining that these would always give valid cues as to the upcoming words. Participants responded with a left/right index button press, with yes/no response side counterbalanced across participants. Each trial started with a blank screen presented for a variable interval from 700 to 1200 ms, followed by the image-cue for 250 ms, and after a fixed (1750 ms: experiment 2) or variable (1250 to 2250 s: experiment 1) interval, the word was presented auditorily. Participants had a maximum of 500 ms to give a response, and visual feedback was provided after incorrect trials (red cross in the center of the screen for 100 ms).

In addition to the response trials, there were 40 catch trials. These were identical to the experimental trials, but the image presented was the original color-version and participants were instructed not to respond to the word in these cases. These were included to make sure that participants were attending the image-cue, and not just waiting for the word.

Auditory stimuli were presented through plastic tubes and silicon earpieces to participants’ ears. Visual stimuli were presented within a 550 × 550 px medium-gray square against a black background, on a back-projection screen situated 150 cm away from the participant. The experimental block lasted approximately 30 minutes. Participant-controlled pauses were provided every 10 trials, in addition to two experimenter-controlled ones. The experimental session included another 30 minutes block with a similar paradigm in which target words were presented in written form rather than auditorily. However, in the present paper we report the results for the auditory block only. Order of presentation of these two blocks was counterbalanced across participants. Between them, 10 minutes of resting state were recorded. Overall, the recording session lasted approximately two hours.

MEG data acquisition

Brain activity was recorded in a magnetically shielded room using a whole head MEG system (Vectorview, Elekta/Neuromag) with 306 sensors arranged in triplets comprised of one magnetometer and two orthogonal planar gradiometers. Participants were screened for magnetic interference prior to data collection and instructed to limit head and face movements as much as possible, and to fixate on the center of the screen. Data was acquired with a 1000 Hz sampling rate and filtered during recording with a high-pass cutoff at 0.1 Hz and a low-pass cutoff at 330 Hz via the Elekta acquisition software. Head movements were monitored continuously using five head position indicator coils attached to the participant’s head. Their location relative to fiducials (nasion and left and right pre-auricular points) was recorded at the beginning of the session using an Isotrak 3-D digitizer (Fastrak Polhemus, USA). In addition, head shape was digitized to allow for alignment to each subject’s structural MRI for subsequent source localization. Eye movements and heartbeats were monitored using vertical and horizontal bipolar electro-oculograms (EOG) and electrocardiogram (ECG).

MRI data acquisition

Participants’ high-resolution 3D structural MRIs (T1-weighted MPRAGE sequence) were acquired with a 3 T Trio scanner (Siemens, Munich, Germany) and a 32-channel head coil. To limit head movement, the area between participants’ heads and the head coil was padded with foam and participants were asked to remain as still as possible. Snugly fitting headphones were used to dampen background scanner noise and to enable communication with experimenters while in the scanner.

MEG data preprocessing

MEG data were initially preprocessed using Elekta’s MaxFilter 2.2 software, including head movement compensation, down-sampling to 250 Hz, and noise reduction using signal space separation method49 and its temporal extension (tSSS) for removing nearby artifacts50. Manually-tagged bad channels were substituted by interpolated values.

Subsequent data analysis was carried out in Matlab 2012b, using the FieldTrip toolbox51. The recordings were segmented from −1000 ms to 4000 ms (experiment 1) or 3500 ms (experiment 2) relative to the presentation of the picture, and low-pass filtered at 100 Hz. Since in experiment 1 the delay between the cue and the target was variable, trials were then trimmed to exclude the response to the actual word. This allowed the generation of an image-locked average containing only preparatory activity. In addition, trials for this experiment were re-segmented time-locked to word presentation, in order to allow examination of the event-related response to the actual word.

Eye movement, blink and electrocardiographic artifacts were linearly subtracted from recordings using independent component analysis (ICA)52. ICA components responsible for eye movements were identified calculating correlation values between the component signal and the activity of the VEOG/HEOG and the ECG channels with subsequent visual inspection to remove any epochs with remaining artifacts. Fifteen percent of the trials were rejected in experiment 1, and 9% in experiment 2. There were significant differences in trial rejection between experiments (F(1,39) = 11.0, p = 0.002) but not between phoneme expectation conditions (F(1,39) = 1.2, p = 0.3). Also, phoneme expectation conditions and experiments did not interact for the number of rejected trials (F(1,39) = 0.9, p = 0.3). Further sensor-data analysis was performed using gradiometer data only, but both magnetometer and gradiometer data were used for source localization.

Experimental design

Before conducting analysis of the neural response we examined the effect of our experimental manipulations on reaction times to the actual words to establish the presence of predictability and mismatch effects using mixed models with crossed random effects for items and subjects53.

We conducted analysis of the MEG data in two steps: firstly, statistical inference was carried out on sensor level data to establish the presence of the effects of interest. Secondly, identified effects were localized in the brain using a whole-brain source reconstruction approach.

Statistical inference was carried out using cluster-based permutation tests. Two different comparisons were performed. Firstly, the presence of a phoneme expectation effect (fricatives vs plosives) was assessed using the 41 subjects of both experiments pooled together. Secondly, the presence of an interaction between phoneme expectation and temporal predictability was evaluated by contrasting the difference between fricatives and plosives for subjects in experiment 1 (variable interval, N = 20) to those in experiment 2 (fixed interval, N = 21).

Behavioral analysis

Behavioral data were analyzed using the free software statistical package R54, and the lme455 library. Given the controversy behind calculating degrees of freedom and corresponding p-values in these types of models, we evaluated the significance of predictors using the normal approximation (|t > 2|).

Trials with incorrect responses and reaction times (RT) under 0.2 s were removed before model fitting. The resulting RTs served as the dependent variable against mixed effects multiple regression models were built. Our independent variables of interest included the following bivariate categorical variables: Phoneme (fricative or plosive), Image cue (predictive or nonpredictive), Pronunciation (correct or incorrect) and Experiment (1 or 2). These variables were coded using treatment coding, making plosive, predictive, correct, Experiment 1 trials as the reference level.

Sensor level analysis

Event related fields

The word-locked epochs for each experiment were low pass filtered at 35 Hz. The uncombined gradiometer signals were then averaged according to experimental condition, and baseline corrected using a 500 ms window prior to image onset (in order to preserve phoneme-expectation differences that were hypothesized to be present before word onset). Finally, the orthogonal directions of each gradiometer pair were combined using the Euclidean norm.

Time-frequency

Time-frequency representations over 3–30 Hz were obtained for the image-locked epochs in each experiment using a frequency-independent Hanning taper and a 500 ms sliding window advancing in 40-ms steps, giving rise to a 2-Hz frequency resolution. Power estimates were then separated into the two phonological expectation conditions and averaged over trials. For each gradiometer pair, power was averaged across the two sensors, resulting in 102 time-frequency power maps. Power was then normalized by its baseline value (450 to 250 ms prior to picture presentation).

Statistical analysis

Differences between conditions were assessed using cluster-based permutation tests56. This analysis controls for multiple comparisons whilst maintaining sensitivity by taking into account the temporal, spatial and, for time-frequency data only, frequency dependency of neighbouring samples. First, the data were clustered by performing pairwise comparisons (t-tests) between each sample (time-frequency-sensor or time-sensor point) in two conditions. Contiguous values exceeding a p = 0.05 threshold were grouped in clusters, and a cluster-based statistic was derived by adding the t-values within each cluster. Then, a null distribution assuming full exchangeability (i.e. no difference between conditions) was approximated by drawing 3000 random permutations of the observed data and calculating the cluster-level statistics for each randomization. Finally, the cluster-level statistics observed in the actual data were evaluated under this null distribution.

Dependent-samples t-tests were employed for the sample-pairwise comparisons in the phoneme effect analysis (fricatives vs plosives, within-subjects contrast over 41 subjects), whereas independent-samples t-tests were used in the analysis for the interaction between phoneme and temporal uncertainty (experiment 1 vs experiment 2, between-subjects contrast, 20 and 21 subjects respectively).

For the event-related fields, a 500 ms second window centered around word onset was statistically analyzed, in order to include early responses to the word, and activity just prior to it. For the time-frequency data, the statistical analysis was performed in a time window ranging from image offset (250 ms) to the minimum trial length in experiment 1 (1500 ms).

Source reconstruction

Source reconstruction was carried out to identify the brain areas underpinning the experimental effects detected at the sensor level, both for the event-related fields and the time-frequency data.

Different source reconstruction approaches were implemented for each kind of data: a minimum-norm estimate (MNE57) for the event-related fields, and a linearly constrained minimum variance beamformer (LCMVB58) for the time-frequency data. Although theoretically there are no reasons to prefer different models for each type of data, from a practical perspective beamforming may work better for sustained brain responses (such as induced modulations of ongoing oscillations) and MNE for shorter-lived responses as evoked by a stimulus59.

Participants’ high-resolution 3D structural MRIs (T1-weighted) were segmented using Freesurfer software57,60,61. The MRI and MEG coordinate systems were co-registered using the three anatomical fiducial points for initial estimation and the head-surface points for manual adjustment of the surface co-registration. We then computed realistic head-models (one-shell boundary element) using these segmented T1 images. The MRI was missing for one participant and we therefore used a template head model in this case. The forward model was computed for three orthogonal source orientations, placed on a 5 mm grid covering the whole brain using MNE suite62 (Martinos Centre for Biomedical Imaging). Each source was then reduced to its two principal components of highest singular value, which closely correspond to sources tangential to the skull. Both planar gradiometers and magnetometers were used for inverse modeling after dividing each sensor signal (and the corresponding forward-model coefficients) by its noise standard deviation. The noise variance was estimated from the 500-ms baseline prior to picture onsets in all conditions.

Event related fields

Windows for source localization were selected a-priori, to reflect activity just before and just after word presentation (250 ms each). An MNE inverse solution57 was used to project the ERF data into source space, using a noise co-variance matrix estimated from a 500 ms pre-image baseline. Source power in the two windows of interest (250 ms before and 250 after word presentation) was then averaged and normalized by the power in the baseline. In order to allow for subsequent group level analysis, the individual MRIs were mapped to the standard Montreal Neurological Institute (MNI) brain through a non-linear transformation using the spatial-normalization algorithm implemented in Statistical Parametric Mapping (SPM863,64), and the ensuing spatial transformations were applied to individual maps.

We located regions of peak activity with respect to baseline and restricted between-condition comparisons to those sites (see65). Non-parametric permutation tests26 were used to identify regions of significant change with respect to baseline, using both predictive conditions. Within these regions, we identified the coordinates of local maxima (sets of contiguous voxels displaying higher power than all other neighboring voxels). Any maxima located in deep brain structures were discarded due to their probable artifactual nature. In practice, subject- and group-level baseline power maps were computed as done for the maps reflecting trial-dependent activity. Group-level difference maps were obtained by subtracting f-transformed trial and baseline group-level power maps. Under the null hypothesis that power maps are the same whatever the experimental condition, the labeling trial and baseline are exchangeable at the subject-level prior to group-level difference map computation. To reject this hypothesis and to compute a threshold of statistical significance for the correctly labeled difference map, the permutation distribution of the maximum of the difference map’s absolute value was computed for all possible 4096 (=212) permutations. The threshold at p < 0.05 was computed as the 95-percentile of the permutation distribution26. All supra-threshold local coherence maxima were interpreted as indicative of brain regions showing statistically significant activity elicited by our experimental manipulation. In order to estimate which of the identified cortical sources contributed more to the phoneme effects detected at sensor level, we then calculated the t-statistic for the phoneme condition contrast (fricatives vs. plosives) at each source and ranked the sources accordingly.

Time-frequency

Power and cross-spectral density (CSD) matrices were estimated in the time-frequency windows selected on the basis of the sensor-level statistical analysis, and in an equally-sized baseline period prior to image onset. The real part of the combined CSD matrices from both the baseline and the time window of interest in all conditions was used to compute an LCMVB58 beamformer.

Subsequent group-level analysis proceeded as with the ERF data. In this case, permutation tests were performed on log-transformed power ratios (activity of interest to baseline). Peak coordinates of power-ratio were then identified. The relative contribution of each source to the phoneme effect was explored by examining the standardized mean difference between fricative and plosive expectation conditions at each peak source.

Finally, in order to explore the temporal evolution of power at the identified sources over the whole trial, we extracted the time courses for these peaks. We used a forward solution restricted to these locations and CSD matrix covering the frequencies of interest and the whole trial length to project the time domain data to these sources. We then used the same parameters for spectral decomposition employed in the sensor level analysis, to obtain the time-course for each frequency band of interest and each phoneme expectation condition at each peak activation location.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.