Semantic vectors

The notion of a semantic vector was initially introduced with Latent Semantic Analysis2 in 1997, but the most commonly used types were not available until recently. We first carried out a comparison of all types of semantic vectors available with regard to how well they could predict human judgments on behavioral tasks, published in ref. 4; see also ref. 3. Two of the semantic vector representations were superior to others: word2vec46 and GloVe27. Both methods generate 300-dimensional vectors to represent each word by exploiting the statistics of word co-occurrences in a certain context (typically the span of a sentence, or a window of 5–10 words), tabulated over corpora with tens of billions of words. In word2vec, the representation learned is such that the identity of a word can be predicted given the average semantic vector of the other words in the context (e.g., the 5 words before and the 5 words after). In GloVe, the representation learned is derived directly from a normalized version of the matrix of word co-occurrences in context. We opted for GloVe for practical reasons, like the homogeneity of value ranges in different dimensions and the vocabulary size, but decoding performance was similar with word2vec vectors. A number of other semantic representations have been put forward recently, but improvements in behavioral or decoding performance have been marginal at best.

Generating vectors to represent sentences is typically done by averaging vectors for the content words (after dereferencing pronouns)5, as we did in this paper. There is only one method for generating vectors for sentences directly that is in widespread use, skip-thought vectors. This method aims to represent a sentence in a vector containing enough information to reconstruct the ones preceding and succeeding it in the training text where it occurs. Skip-thought vectors differ from GloVe in their high-dimensionality—4800 dimensions rather than 300—and heterogeneity, since dimensions have very different distributions of values instead of being roughly Gaussian distributed. From an engineering standpoint, it is more complicated to decode them from imaging data, since ridge regression models are not appropriate for many of the dimensions. Vector comparison between decoded and text-derived vectors is also more involved, since some dimensions are much more influential than others in the computation of typical measures such as correlation or Euclidean distance. We did reproduce our analysis using skip-thought vectors, with results shown on Supplementary Figure 3; using the same decoder and procedures as with GloVe vectors, the results were virtually the same.

Spectral clustering of semantic vectors

In order to select words to use as stimuli, we started with GloVe semantic vectors for the 29,805 basic vocabulary words in ref.28 for which vectors were available. We then carried out spectral clustering29 of those vectors to identify 200 regions (clusters) of related words. We used spectral clustering for two reasons. The first is that traditional clustering algorithms like k-means implicitly assume multivariate Gaussian distributions for each cluster with the same covariance structure by using Euclidean distance between vectors. It is unclear whether this assumption is reasonable here, and our early attempts that used k-means produced many clusters that were not readily interpretable. The second reason is that the experiments described in refs. 3,4 showed that cosine similarity (or correlation) between the semantic vectors reflects semantic relatedness. Spectral clustering uses the cosine similarity between words to place them in a new space where words with the same profile of (dis)similarity to all other words are close together, and then performs k-means in this new space. This is, intuitively, more robust than simply comparing the vectors of any two words when grouping them together, and the results subjectively bore it out (in terms of the interpretability of the resulting clusters).

The spectral clustering procedure consisted of these steps:

(1) calculate a 29,805 by 29,805 matrix C with the cosine similarity for each pair of vectors; (2) normalize C from (1) to fall in a 0–1 range \(\left( {C \leftarrow \frac{{C + 1}}{2}} \right)\), and zero out the diagonal; (3) normalize C from (2) so that each row adds up to 1; (4) compute a 100-dimensional eigendecomposition D of C from (3); (5) run k-means (k = 200) on D using squared Euclidean distance and the k-means ++ algorithm (k-means ++ is a more stable version of k-means, used by default in MATLAB).

Stimulus selection and semantic space coverage

The number of clusters sought was determined by the maximum number of stimuli that could be fit into a single scanning session, with up to 6 repetitions (median cluster size 150 words, range 81–516). As discussed in the main text, almost all the 200 clusters were easy to interpret. A small percentage (~10%) were harder to interpret; these tended to contain (i) infrequent, unrelated words or (ii) extremely frequent, uninformative words. Excluding these left us with 180 clusters (see the “Data Availability” statement for online access to the clusters and stimulus materials). Finally, for each cluster we manually identified a key representative word, which was either the intuitive “name” of the group of words or a prototypical member. The resulting 180 target words consisted of 128 nouns, 22 verbs, 23 adjectives, 6 adverbs, and 1 function word. We selected five additional words from among the 20 most frequent cluster members, based on their prototypicality, to be used in generating the experimental materials.

To quantify the degree to which each dimension was spanned by the 180 key words, we defined a measure of dimension usage (Supplementary Figure 4). We consider each dimension “represented” by a word if the absolute value for that dimension in the vector for the word is in the top 10% of magnitude across the 29,805-word vocabulary. At least 5 words represented each dimension, and most dimensions had 10 or 20 words representing them (median = 16).

Design of fMRI experiment 1 on words

For each of 180 target words, we created 6 sentences, 4–11 words long (mean = 6.85, st.dev. = 1.22), and each containing the target word used in the intended sense. Almost all sentences (1001/1080; 92.7%) contained at least one other representative word from the same cluster. We further found for each of 180 words, 6 images in the Google Image database. Thus, in the first two paradigms, each word appeared in a different sentence or with a different image across the repetitions of the word (Fig. 3). For the third paradigm, we selected five representative words from the cluster, and each word appeared with the same five words across repetitions, placed in a word cloud configuration around it. Please see the Data Availability statement for online access to the materials.

In the sentence paradigm, participants were asked to read the sentences and think about the meaning of the target word (in bold) in the context in which it is used. In the picture paradigm, participants were asked to read the word and think about its meaning in the context of the accompanying image. And in the third paradigm, participants were asked to read the target word (bolded, in the center of the word cloud) and to think about its meaning in the context of the accompanying words.

Within each scanning session, the 180 words were divided into two sets of 90 (done separately for each participant and paradigm) and distributed across two runs. Thus, it took two runs to get a single repetition of the full set of 180 words. Each participant saw between 4 and 6 repetitions for each of the three paradigms (Supplementary Table 5).

Across paradigms, each stimulus was presented for 3 s followed by a 2 s fixation period. Each run further included three 10 s fixation periods: at the beginning and end of the run, and after the first 45 trials. Each run thus took 8 min. Please see the Data Availability statement for online access to the presentation scripts.

Design of fMRI experiments 2 and 3 on sentences

Experiment 2 used 96 passages, each consisting of 4 sentences about a particular concept, spanning a broad range of content areas from 24 broad topics (e.g., professions, clothing, birds, musical instruments, natural disasters, crimes, etc.), with 4 passages per topic (e.g., clarinet, accordion, piano, and violin for musical instruments; Supplementary Figure 1). All passages were Wikipedia-style texts that provided basic information about the relevant concept. Experiment 3 used 72 passages, each consisting of 3 or 4 sentences about a particular concept. As in experiment 2, the passages spanned a broad range of content areas from 24 broad topics, unrelated to the topics in experiment 2 (e.g., skiing, dreams, opera, bone fractures, etc.), with 3 passages per topic (Supplementary Figure 1). The materials included Wikipedia-style passages (n = 48) and first-/third-person narratives (n = 24). The two experiments were comparable in their within- and between-passage/topic semantic similarities (Fig. 5).

The sentences were 7–18 words long (mean = 11.8, st.dev. = 2.1) in experiment 2, and 5–20 words long (mean = 13.15, st.dev. = 2.92) in experiment 3. The passages in experiment 2 consisted of 4 sentences, and in experiment 3 they consisted of 3 or 4 sentences (mean = 3.375, st.dev. = 0.49). Sentences were presented in PsychToolbox font size 10 (variable width, average line on our display fits approximately 60 characters). If a sentence was longer than 50 characters, it was presented on 2 or, occasionally, 3 lines. The set of lines was always centered, horizontal and vertically.

In both experiments, participants were asked to attentively read the passages, presented one sentence at a time. The passages were divided into 8 sets of 12 (experiment 2) or 9 (experiment 3), corresponding to 8 runs. Thus, for each experiment, it took 8 runs to get a single repetition of the full set of 96/72 passages. Each participant did 3 repetitions (i.e., 24 runs, distributed across 3 scanning sessions). The division of passages into runs and the order was randomized for each participant and scanning session.

Each sentence was shown for 4 s followed by a 4 s fixation period. Each passage was further followed by another 4 s fixation period. Thus, each passage took 28 s (3-sentence passages) or 36 s (4-sentence passages). Each run further included a 10 s fixation at the beginning and end. The runs in experiment 2 were thus 452 s (7 min 32 s). Given that texts differed in length in experiment 3, and to make runs similar in duration, the texts were semi-randomly assigned to runs, so that the first 5 runs consisted of 6 three-sentence passages and 3 four-sentence passages for a total run duration of 296 s (4 min 56 s); and the last 3 runs consisted of 5 three-sentence passages and 4 four-sentence passages for a total run duration of 304 s (5 min 4 s).

Please see the Data Availability statement for online access to the materials and presentation scripts.

Participants

Sixteen participants (mean age 27.7, range 21–50, 7 females), fluent speakers of English (15 native speakers, 1 bilingual with native-like fluency), participated for payment. Participants gave informed consent in accordance with the requirements of the Committee on the Use of Humans as Experimental Subjects (MIT) or Research and Integrity Assurance (Princeton). All 16 participants performed experiment 1 (three 2 h sessions), and the decoding results on their data were used to prioritize subjects for scanning on the other two experiments. Eight of the 16 participants performed experiment 2 (three 2 h sessions), and 6 of the 16 participants performed experiment 3 (three 2 h sessions). Four additional participants were scanned but not included in the analyses due to excessive motion and/or sleepiness during experiment 1 sessions. For two of these (scanned at Princeton), an incorrect echo-planar imaging (EPI) sequence was used, with no prospective motion correction; the other two were novice subjects (at MIT). These participants were excluded based on (i) visual inspection of framewise displacement47 over time to evaluate whether it exceeded 0.5 mm multiple times over the sentence or picture sessions (since at least those two were needed to train a decoder), and (ii) self-reports of sleepiness/difficulty paying attention.

fMRI data acquisition and processing

Structural and functional data were collected on a whole-body 3-Tesla Siemens Trio scanner with a 32-channel head coil at the Athinoula A. Martinos Imaging Center at the McGovern Institute for Brain Research at MIT or at the Scully Center for the Neuroscience of Mind and Behavior at Princeton University. The same scanning protocol was used across MIT and Princeton. T1-weighted structural images were collected in 128 axial slices with 1 mm isotropic voxels (repetition time (TR) = 2530 ms, echo time (TE) = 3.48 ms). Functional, blood oxygenation level-dependent data were acquired using an EPI sequence (with a 90° flip angle and using GRAPPA with an acceleration factor of 2), with the following acquisition parameters: 31 4 mm thick near-axial slices, acquired in an interleaved order with a 10% distance factor; 2.1 mm × 2.1 mm in-plane resolution; field of view of 200 mm in the phase encoding anterior to posterior (A>P) direction; matrix size of 96 × 96 voxels; TR of 2000 ms; and TE of 30 ms. Further, prospective acquisition correction51 was used to adjust the positions of the gradients based on the participant’s motion one TR back. The first 10 s of each run were excluded to allow for steady-state magnetization.

MRI data were analyzed using FSL (http://fsl.fmrib.ox.ac.uk/fsl/)48 and custom MATLAB scripts. For each participant, we picked the structural scan in the sentence session as a reference and estimated a rigid registration of the structural scans from other sessions to it. The functional data from the runs in each scanning session were corrected for slice timing, motion, and bias field inhomogeneity and high-pass filtered (at 100 s cutoff). They were then registered to the structural scan in their own session, and thence to the reference structural scan (combining the two matrices), and finally resampled into 2 mm isotropic voxels. The reference structural scan was registered to the MNI template (affine registration+nonlinear warp), and the resulting transformation inverted to generate subject-specific versions of the various atlases and parcellations used. The responses to each stimulus were estimated using a general linear model (GLM) in which each stimulus presentation (sentence/word+picture/word+word cloud in experiment 1, and sentence in experiments 2–3) was modeled with a boxcar function convolved with the canonical haemodynamic response (HRF).

Decoding methodology

The decoder operates by predicting a semantic vector given imaging data. Each dimension is predicted using a separate ridge regression with the regularization parameters estimated from training data. More formally, given a number of examples by number of voxels imaging data matrix X (training set), and the corresponding number of examples by number of dimensions semantic vector matrix Z, we learn regression coefficients b (a vector with as many entries as voxels) and b0 (a constant, expanded to a vector) that minimize

$$\left\| {{\boldsymbol{Xb}} + {\boldsymbol{b}}{\bf 0} - {\boldsymbol{z}}} \right\|_2^2 + \lambda \left\| {\boldsymbol{b}} \right\|_2^2$$

for each column z—a semantic dimension—of the Z matrix.

The regularization parameter λ is set separately for each dimension using generalized cross-validation within the training set49. Each voxel was mean-normalized across training stimuli in each imaging experiment, as was each semantic vector dimension.

In experiment 1, the decoder was trained within a leave-10-words-out cross-validation procedure. In each fold, the regression parameters for each dimension were learned from the brain images for 170 words, and predicted semantic vectors generated from the brain images for 10 left-out words. The voxelwise normalization was carried out using a mean image derived from the training set, which was also subtracted from the test set. The cross-validation procedure was carried out on data of each of the three paradigms separately or on the dataset resulting from averaging them; we report results for all of these. This resulted in 180 decoded semantic vectors. In experiments 2 and 3, the decoder was trained on brain images for 180 words from experiment 1 and applied to brain images of 384 and 243 sentences, respectively, resulting in 384 and 243 decoded semantic vectors.

Each decoder was trained on images reduced to a subset of 5000 voxels, approximately 10% of the number left after applying a cortical mask. We picked this number as a conservative upper bound for the number of informative voxels, as determined in previous studies20. Voxels were selected by the degree to which they were informative about the text-derived semantic vectors for training set images, as measured by ridge regressions on the voxel and its adjacent three-dimensional (3D) neighbors. We learned ridge regression models (regularization parameter set to 1) to predict each semantic dimension from the imaging data of each voxel and its 26 adjacent neighbors in 3D, in cross-validation within the training set. This yielded predicted values for each semantic dimension, which were then correlated with the values in the true semantic vectors. The informativeness score for each voxel was the maximum such correlation across dimensions.

In experiment 1, voxel selection was done separately for each of the 18 cross-validation folds, with a nested cross-validation inside each 170 concept training set (any two training sets share ~95% of concepts). In experiments 2 and 3, voxel selection was done using the entire 180 concept dataset used to train the decoder.

Statistical testing of decoding results

To obtain the decoding results reported in Fig. 4a, we carried out pairwise classification on decoded vectors and the corresponding text-derived semantic vectors. The set of pairs considered varies across experiments and classification tasks. In each pair, we calculated the correlation between the two decoded vectors and the two text-derived semantic vectors. Classification was deemed correct if the highest correlation was between a decoded vector and the corresponding text semantic vector. The accuracy values reported are the fractions of correctly classified pairs. For experiment 1, we compared every possible pair out of 180 words. For experiments 2 and 3, we compared every possible pair of sentences where sentences came from (i) different topics, (ii) different passages within the same topic, and (iii) the same passage.

All accuracy values were tested with a binomial test38, which calculates:

$$P(X \ge {\mathrm{number}}\,{\mathrm{of}}\,{\mathrm{correct}}\,{\mathrm{pairs}}\left| {H_0} \right.:{\mathrm{classifier}}\,{\mathrm{at}}\,{\mathrm{chance}}\,{\mathrm{level}})$$

and requires specification of the number of independent pairs being tested. As results are correlated across all pairs that involve the same concept or sentence (they all share the same decoded vector), we used extremely conservative values. For experiment 1, we used 180, the number of distinct words. For experiments 2 and 3, we used (i) the number of passage pairs across different topics, multiplied by the minimum number of sentences per passage, (ii) the number of passage pairs within the same topic, multiplied by number of topics and the minimum number of sentences per passage, and (iii) the number of sentences.

The rank accuracy results reported in Fig. 4b were obtained by comparing each decoded vector to a decoding range, a set of candidate vectors corresponding to all the possible stimuli in experiments 1 (180 words), 2 (384 sentences), and 3 (243 sentences). We then calculate the rank of the correct stimulus in that range, and average it across all decoded vectors. This average rank score is normalized into a rank accuracy score, \(1 - \frac{{\left\langle {\mathrm{rank}} \right\rangle - 1}}{{\left\langle {\# {\mathrm{vectors}}\,{\mathrm{in}}\,{\mathrm{range}}} \right\rangle - 1}}\).

The rank accuracy score is in [0,1], with 1 corresponding to being at the top of the ranking and 0 at the bottom; chance level performance is 0.5. The rank accuracy score is commonly used in information retrieval in situations where there are several elements in a range that are similar to the correct one. It becomes the usual accuracy if there are two elements in the range. The null model for chance level performance treats each ranking as a multinomial outcome, which is then normalized to a rank accuracy score. Using the Central Limit Theorem, the average of the scores has a normal distribution, with mean 0.5 and variance \(\frac{{\# {\mathrm{vectors}}\,{\mathrm{in}}\,{\mathrm{range}} + 1}}{{12\left( {\# {\mathrm{vectors}}\,{\mathrm{in}}\,{\mathrm{range}} - 1} \right)\# tests}}\). Again, as the outcomes for different sentences in the same passage could possibly be correlated, we used a conservative value for the number of tests (number of passages rather than number of sentences).

The word-spotting results were obtained by comparing each decoded vector to the vectors for all words in a basic vocabulary of ~30 K words, calculating the rank accuracy of each word in the corresponding stimulus sentence, and taking the maximum of those as the word-spotting score. Given that, for different subjects, the sentences with high scores might be different (as might the words within each sentence), we opted to compare the distribution of scores in each subject against the distribution under a null model. We obtain the null model in this case by simulating as many multinomial draws as there are words in each sentence, normalizing them, and calculating the word-spotting score for the sentence by taking the highest value. In each simulation run we obtain a histogram of results, and we average these bin-wise across 1000 runs to yield a distribution under the null model (under the Central Limit Theorem that suffices to get a precise estimate of bin counts). This is compared with the average of the subject histograms in Fig. 4c. To produce p-values under the null hypothesis we tested the histogram of each subject against the null model histogram, using a two-sided Kolmogorov–Smirnov test. Finally, in all three measures we applied a Bonferroni multiple comparison correction, accounting for the number of subjects and classification tasks.

For the comparison between results obtained selecting voxels from anywhere in the brain and selecting them while restricting them to different networks, we used two separate tests. For pairwise accuracy, we used a one-sided paired t-test, with each sample containing the average accuracy for pairs involving each sentence, for all subjects (sample sizes were 3072 and 1458 for experiments 2 and 3, respectively). For the rank accuracy results we used a simple sign test with samples containing the rank accuracy results for all subjects. While testing for significance, we applied Bonferroni correction taking into account the number of classification tasks (3 pairwise accuracy+1 rank accuracy), networks (5), and experiments (2).

Model building approaches

In order to provide a more structured perspective of the commonalities and differences between related studies and ours, we highlight the key study characteristics in Supplementary Tables 6 and 7

Supplementary Table 6 contrasts the studies in terms of (i) the type of model being learned, (ii) what is being predicted, and (iii) the task used to quantitatively evaluate the prediction. In all studies, the model building (training) stage consists of learning a relationship between the representations of input stimuli and the imaging data. This relationship is then used to make a prediction about new stimuli during the test stage. This is, typically, either a prediction of (i) the imaging data (on a voxel-by-voxel basis) in response to test stimuli, or (ii) the representation of the stimulus presented when test imaging data were acquired. Both types of prediction lend themselves naturally to pairwise classification as an evaluation task, especially in cross-validation approaches, as it is straightforward to leave out two test stimuli (and corresponding imaging data) and build a model from the remainder. This task is also appealing because the statistical testing of its results is well understood, given enough precautions to ensure the independence of training and test data50. The studies that, like ours, extract a representation of the stimulus allow for a wider range of evaluation strategies beyond the pairwise classification approach, including, for example, comparing the semantic vector extracted to those of hundreds of candidate concepts and sentences, using rank accuracy, or generating an approximate reconstruction of the stimulus. For ease of comparison across studies, we focus on classification tasks in Supplementary Table 6.

Supplementary Table 7 contrasts the studies in terms of the type and range of stimuli used. For evaluating the ability of the decoder to generalize to new stimuli, the training and testing data ideally come from separate experiments, and even different experimental paradigms. If this is not feasible, then the most common approach is cross-validation, where a portion of the data is used as training data, and the remainder of the data (from the same experiment) as test data. The latter is the strategy used in all the studies that have included a single experiment. A handful of studies have included two experiments (one for training and one for testing). In contrast, our study consisted of five experiments (three used in combination for training the model and two separate test experiments). For ease of comparison with prior work, we included a cross-validated evaluation within our single-word experiment (experiment 1), in addition to reporting the results for the two separate test experiments (experiments 2 and 3). To our knowledge, our study is the first to show generalization from individual words to sentence stimuli across scanning sessions.

Code availability

The stimulus presentation code is available via the paper website (https://osf.io/crwz7). The only custom code used to produce results was (i) code for training the decoder from imaging data (implementing a well-known regression approach51) and (ii) code for identifying informative voxels (with a similar regression model applied to voxels and their spatial neighbors). Both of these functions are available in the paper website.

Data availability

The imaging datasets used for experiments 1, 2, and 3 are publicly available on the paper website (https://osf.io/crwz7). Given the size of the datasets (~150 2 h imaging sessions), we provide MATLAB data containing solely deconvolved images and labels for each stimulus (concept or sentence). The raw and processed NIFTI imaging datasets, as well as associated event files, will be shared via a repository (http://www.openfmri.org), after re-processing. Any updates on the data and scripts will be posted on the paper website (https://osf.io/crwz7).