For all experiments, 128-channel EEG data (plus two mastoid channels) were acquired at a rate of 512 Hz using an ActiveTwo system (BioSemi). Triggers indicating the start of each trial were sent by the stimulus presentation computer and included in the EEG recordings to ensure synchronization. Offline, the data were band-pass filtered between 1 and 8 Hz, downsampled to 128 Hz, and re-referenced to the average of the mastoid channels in MATLAB. To identify channels with excessive noise, the time series were visually inspected and the SD of each channel was compared with that of the surrounding channels. Channels contaminated by noise were recalculated by spline interpolating the surrounding clean channels in EEGLAB [].

For the multisensory experiment, the stimuli were drawn from a set of videos that consisted of a male speaking American English in a conversational-like manner. Fifteen 60 s videos were rendered into 1280 × 720-pixel movies at 30 frames/s and exported in audio-only (A), visual-only (V), and AV format in VideoPad Video Editor (NCH Software). The soundtracks were sampled at 48 kHz, underwent dynamic range compression, and were matched in RMS intensity (see []), and were mixed with spectrally-matched stationary noise to ensure consistent masking across stimuli []. The noise stimuli were generated in MATLAB (The MathWorks) using a 50th-order forward linear predictive model estimated from the original speech recording. Prediction order was calculated based on the sampling rate of the soundtracks []. The data analyzed here were from the A and AV condition only. Please note, the presentation order of A, and AV repetitions was randomized across the 15 videos and across subjects. The data from all 21 subjects (aged 21–35 years; 13 male) in this experiment have been published previously [].

In the cocktail party experiment, 33 subjects (aged 23–38 years; 27 male) undertook 30 trials, each of 60 s in length, where they were presented with 2 classic works of fiction: one to the left ear, and the other to the right ear. Each story was read by a different male speaker. Subjects were divided into 2 groups of 17 and 16 (+1 excluded subject) with each group instructed to attend to the story in either the left or right ear throughout the entire 30 trials. After each trial, subjects were required to answer between 4 and 6 multiple-choice questions on both stories. Each question had 4 possible answers. We used a between-subjects design as we wanted each subject to follow just one story to make the experiment as natural as possible and because we wished to avoid any repeated presentation of stimuli. For both stories, each trial began where the story ended on the previous trial. Stimulus amplitudes in each audio stream within each trial were normalized to have the same root mean squared (RMS) intensity. In order to minimize the possibility of the unattended stream capturing the subjects’ attention during silent periods in the attended stream, silent gaps exceeding 0.5 s were truncated to 0.5 s in duration. Stimuli were presented using Sennheiser HD650 headphones and Presentation software from Neurobehavioral Systems ( http://www.neurobs.com ). Subjects were instructed to maintain visual fixation on a crosshair centered on the screen for the duration of each trial, and to minimize eye blinking and all other motor activities. The data from all subjects in this experiment have been published previously [].

In the N400 experiment, subjects read 300 sentences presented word-by-word on a screen, half of which ended with a word that was congruent (high cloze probability) with the rest of the sentence and half which ended with an incongruent (low cloze probability) word. N400s were then determined by subtracting the event-related potential to the congruent words from that to the incongruent words. And, using the EEG data recorded during the story, we derived a semantic dissimilarity TRF for each subject as before.

In the first experiment, subjects undertook 20 trials, each of the same length (just under 180 s), where they were presented with a professional audio-book version of a popular mid-20th century American work of fiction written in an economical and understated style and read by a single male American speaker. The trials preserved the storyline, with neither repetitions nor discontinuities. The average speech rate was ∼210 words/min. Similarly, the second experiment involved the presentation of the same trials in the same order, but with each of the 28 speech segments played in reverse. All stimuli were presented monophonically at a sampling rate of 44.1 kHz using Sennheiser HD650 headphones and Presentation software from Neurobehavioral Systems ( http://www.neurobs.com ). Testing was carried out in a dark, sound-attenuated room and subjects were instructed to maintain visual fixation on a crosshair centered on the screen for the duration of each trial, and to minimize eye blinking and all other motor activities. Data from 10 of the subjects (aged 23–38 years; 7 male) who participated in the first experiment and all of the subjects (aged 21–32 years; 7 male) who participated in the second experiment have been published previously []. Data from an additional 9 subjects (aged 19–32 years, 6 male) for the first experiment were collected for the current study.

Computational Model and Regression

16 Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013) Efficient estimation of word representations in vector space. arXiv, arXiv:1301.3781, http://arxiv.org/abs/1301.3781. 48 Baroni, M., Dinu, G., and Kruszewski. G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 1, Long Papers (Association for Computational Linguistics), pp. 238–247. 48 Baroni, M., Dinu, G., and Kruszewski. G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 1, Long Papers (Association for Computational Linguistics), pp. 238–247. 48 Baroni, M., Dinu, G., and Kruszewski. G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 1, Long Papers (Association for Computational Linguistics), pp. 238–247. 16 Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013) Efficient estimation of word representations in vector space. arXiv, arXiv:1301.3781, http://arxiv.org/abs/1301.3781. Semantic vectors for content words were derived using the state-of-the-art word2vec algorithm []. The “continuous bag of words” implementation built in [] was selected because this was trained on British English corpora (ukWaC, the English Wikipedia and the British National Corpus combined) which is both large and probably more reflective of the language exposure of the participants (in Dublin) than US corpora. In addition, word vectors are freely downloadable (see []). Word2vec embodies the “distributional hypothesis” that words with similar meaning occur in similar contexts in an artificial neural network approach. Practically, the approach involves sliding a fixed window of words (11 in this case, however this is a parameter set by the experimenter) over a text corpus and training a neural network to predict the word in the center of that window. Word identity (as opposed to semantics) is uniquely encoded as a single bit set to one in a long vector of zeros (vector length is the number of words in the vocabulary). These long vectors form the basis of the input and output to the neural network. The input corresponds to the sum of the 10 word vectors in the window, the output is the central word. Because word order is lost in this summation, the input is analogous to an unordered bag of words. The network contains an internal hidden layer of 400 dimensions. The hidden layer is fully connected to the input and output. It is in fact the weights on the connections between the input and hidden layer that are ultimately harvested to form the semantic model (the weights are a number-of-words in the vocabulary by 400 floating point matrix) and the remainder of the network is discarded. Weights are initially set as random, but are subsequently optimized so as to reduce error between predicted and target output. Intuitively, because words that frequently appear together in the same context window also predict similar central words, weights on these words are tuned to similar internal representations reflecting common contexts. For more details on the training procedure see [] and [] (note, the choice of 400 dimensions for the internal layer was arbitrary and, as described in the next paragraph, these 400-dimensional vectors are reduced to a single correlation measure).

18 Mitchell J.

Lapata M. Composition in distributional models of semantics. 19 Kiela, D., and Clark, S. (2014). A systematic study of semantic vector space model parameters. Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC) at EACL, pp. 21–30. 40 Anderson A.J.

Binder J.R.

Fernandino L.

Humphries C.J.

Conant L.L.

Aguilar M.

Wang X.

Doko D.

Raizada R.D.S. Predicting neural activity patterns associated with sentences using a neurobiologically motivated model of semantic representation. 40 Anderson A.J.

Binder J.R.

Fernandino L.

Humphries C.J.

Conant L.L.

Aguilar M.

Wang X.

Doko D.

Raizada R.D.S. Predicting neural activity patterns associated with sentences using a neurobiologically motivated model of semantic representation. 49 Gorman K.

Howell J.

Wagner M. Prosodylab-aligner: a tool for forced alignment of laboratory speech. Having obtained a vector for each word, we then quantified how semantically dissimilar each particular word was to the preceding words in the corresponding sentence. We did this by calculating a Pearson’s correlation between the word’s 400-dimensional vector and the average of the vectors corresponding to all the preceding words in that particular sentence, and subtracting this correlation from 1 (where a specific word was the first word in a sentence, we calculated the correlation between the word and the average of all word vectors in the previous sentence, before, again, subtracting that correlation from 1). It should be noted that, this kind of simple feature-wise averaging/summation of word-level semantic vectors has proven to be an effective and enduring method of modeling semantic composition in computational linguistics (e.g., []). It has also been proven to be a successful method for predicting fMRI activation patterns associated with sentences’ meanings (e.g., []). However, it should also be noted that the approach is a gross oversimplification of the complexities of semantic composition in the brain (and does not take into account the effects of word order or syntax – see discussion []). In any case, our approach produced a single semantic dissimilarity measure for each word with a value between 0 and 2. We then created a “semantic dissimilarity vector” at the same sampling rate as our EEG data (128 Hz) which consisted of time-aligned impulses at the onset of each word that were scaled according to the value of that word’s semantic dissimilarity. The word onset times were determined by performing forced alignment of the speech files and the corresponding textual orthographical transcription using the Prosodylab-Aligner, which has been shown to produce alignments with median precision (misalignment) on the order of 10 ms [].

∗S(t), where ‘∗’ represents the convolution operator. Specifically, estimation of the TRF weights was performed using regularized linear regression, wherein a regularization (ridge) parameter was tuned to control overfitting (see [ 20 Crosse M.J.

Di Liberto G.M.

Bednar A.

Lalor E.C. The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli. A system identification technique was used to compute a channel-specific mapping between the semantic dissimilarity vector and the recorded EEG data, commonly referred to as a temporal response function (TRF). A TRF can be interpreted as a filter that describes the brain’s linear transformation of a stimulus feature, S(t), to the continuous neural response, R(t), over a series of specified time lags: R(t) = TRFS(t), where ‘’ represents the convolution operator. Specifically, estimation of the TRF weights was performed using regularized linear regression, wherein a regularization (ridge) parameter was tuned to control overfitting (see [] for a detailed description of this step).

In previous work, we have attempted to cast our TRF functions with μV as their unit of measure. However, this relies on a decision to normalize the input stimulus values between some limits and, as such, has been somewhat arbitrary. In the present work, and in line with previous work from other groups, the EEG data on each channel was z-scored prior to estimating the TRF, meaning that the TRFs are ultimately presented in arbitrary units. The colors in the TRF topographic plots can be interpreted as follows: red at a particular latency indicates that, at that poststimulus lag, the EEG voltage is driven in a positive direction by a unit change in semantic dissimilarity; blue means the EEG voltage at that poststimulus lag is driven negative by a similar change. Thus, given the same normalization strategy for the various speech stimuli used in this study, the TRF responses can be compared in terms of their amplitudes, despite their description in terms of arbitrary units.