Participants

Eighteen native Finnish-speaking, right-handed individuals with no history of developmental or acquired language or other neurological disorders participated in the study. The participants were recruited through student mailing lists in the Aalto University. One participant chose not to finish all measurement runs and was therefore excluded from data analysis. Thus, the sample consisted of 17 individuals (mean age = 20.9 years; SD = 3.3 years, min = 18 years, max = 31 years; mean education = 12.4 years, SD = 1.5 years, min = 12 years, max = 18 years; 10 identified themselves as females, and seven as males). All of the participants gave a written informed consent before participating in this study. The study was approved by the Aalto University Research Ethics Committee.

Stimuli

The stimuli consisted of 540 brief verbal descriptions of 60 target objects in Finnish (9–29 characters including spaces, mean = 17.5, SD = 3.6). Fifty-eight target objects were selected from the CSLB property data set25). We additionally included two target objects that were not part of the CSLB data [forklift (Finnish: ‘trukki’), and metro (subway) (Finnish: ‘metro’)]. One fourth (n = 15) of the target objects fell into each of the following semantic categories: animal, fruit/vegetable, tool and vehicle. We created nine clues (i.e., descriptions) per each target object by translating and adapting semantic features from the CSLB data. For the two objects not included in the CSLB data set, we selected six features from that set that applied to the target object and additionally created three new highly distinctive features. We also created 29 new clues (5.3% of clues in total) in cases where sufficiently many suitable clues were not available in the CSLB data set. The first, second and third clues were matched on length across the four semantic categories (pairwise t-test: p > 0.59 for all).

The nine clues assigned to each target object were further divided into three clue triplets. When feasible, the presentation order of the clues within a triplet was sorted such that the first clue in each triplet was the least distinctive (e.g., ‘has four legs’), and the following two clues increasingly distinctive (e.g., ‘is found in the savannah’ > ‘has a trunk’) based on the CSLB feature norm data25. The purpose of this approach was to ensure that the participants would guess the target object approximately at the same stage (i.e., at the third clue).

Each individual clue was repeated at least twice in the fMRI experiment, once in Set 1 and once in Set 2, with the two sets presented on different days. The clue combinations were rearranged such that each clue’s position in a triplet was retained, i.e., the first clue of the triplet in Set 1 was always the first clue of a triplet in Set 2, but the other clues it was grouped with were not identical in both sets. This procedure resulted in six unique clue triplets for each target object which were presented in six separate blocks. The order of sets (across measurement days) and blocks (within a set) was balanced across subjects. The full list of clues (Set 1) can be found in https://aaltoimaginglanguage.github.io/guess/.

Procedure

The fMRI experiment was conducted on 2 days, with three measurement sessions (i.e., blocks) on each day. We divided the data acquisition into two separate days to ensure that the participants would be able to sustain attention throughout the experiment. The two measurement days were on average 10 days apart (mean = 9.9, SD = 7.9, min = 6 days, max = 35 days) with each fMRI measurement lasting ca. 45 min in total. Each trial started with a fixation cross (‘+’, duration: 300 ms) after which the clues were presented one after another. The clue duration was 1000 ms and the first two clues were followed by a blank screen for 200 ms. The third clue was followed by a jittered interval (mean = 8.0 s, min = 4.0 s, max = 11.8 s), after which a string of hash characters ‘#################’ was presented for 1000 ms, prompting the participant to overtly name the target object (Fig. 4). The interval between the final clue and the naming prompt was relatively long as we attempted to minimize the overlap between the peaks of the BOLD signals. The naming condition was followed by a jittered interval (mean = 4.0 s, min = 2.3 s, max = 6.2 s) after which the next trial started. The jittering was generated using efMRI version 9 (Chris Rorden, Columbia, SC, USA, www.mricro.com). The black text stimuli were presented on a gray background. There were two 18 s rest periods in each measurement session. The rest trials were signaled by a pair of hyphens ‘--‘ that the participant was asked to fixate while remaining still.

Fig. 4 Examples of stimuli and experimental design in fMRI. Three clues were shown one at a time, after which the participants were asked to guess which object they describe (e.g., here: an elephant, a banana, an anchor, an airplane). A string of hash characters prompted the participant to utter the name of the target object. The target object itself was never presented to the participants before or during the experiment, either pictorially or as a word, and no feedback regarding correct or incorrect answer was provided Full size image

Functional MRI data acquisition

Participants were scanned with a Siemens 3 T Skyra Magnetom MRI device using a custom 30-channel receiver head-coil. We acquired echo-planar imaging (EPI) volumes in axial oblique angle using an acquisition matrix of 64 × 64 with 3.1 mm × 3.1 mm × 3.1 mm voxel dimensions. The following acquisition parameters were used: TE = 32 ms, TR = 2.4 s, flip angle = 90°, slices = 41, FOV = 200 mm, phase resolution = 100%. A structural T1-weighted MPRAGE volume was also acquired (TE = 3.3 ms, TR = 1.1 s, slices = 176, FOV = 256 mm, phase resolution = 100%).

The stimuli were controlled using Presentation® 15.0 software (www.neurobs.com) running on a Dell Optiplex 960 PC. The stimuli were projected to a mirror mounted on the head-coil using a Panasonic PT-DZ110XEJ projector with 1920 × 1200 resolution and 60 Hz frequency. Participants’ verbal responses were recorded using an OptoAcoustics (Or-Yehuda, Israel) FOMRI-III optic microphone with OptoActive noise control. The microphone was mounted on the head-coil.

Semantic space from text corpus data

The model of semantic space used in the decoding was estimated from a 1.5 billion token Internet-derived text corpus in lemmatized Finnish22. The semantic space was built using a word2vec skip-gram model with a maximum context of 5 + 5 words (5 words before and after the word of interest)22. The skip-gram model is a fast and efficient method for learning dense vector representations of words from large amounts of unstructured text data. The objective is to find vector representations that are useful for predicting the surrounding words in a sentence given a target word23,44. The code is available online at https://code.google.com/archive/p/word2vec, and the word vector data set used is available online at http://bionlp-www.utu.fi/fin-vector-space-models/fin-word2vec-lemma.bin. The word vectors of the model have the dimensionality of 300, and they were used in the machine learning analyses and the RSA46. Note that single dimensions of the semantic space are not interpretable.

Word2vec was used to acquire altogether six sets of semantic space coordinates: (1) the last single clue of the triplet that was used as the onset for the fMRI response (Clue 3); (2) the sum of the first, second and third clue of the triplet that were used to probe a given target word (Clue 1 + 2 + 3), (3) the target word alone (Target word) and (4) the sum of the semantic coordinates of all features for a given target object available in the CSLB data set (All available features)25, including features that were never presented to the participant. In addition, we generated two models which excluded the clues used to probe the target concept: (5) in one of these models, we mixed the clue sets across blocks such that the semantic coordinates of a given trial were constructed using the clue features of another trial with the same target item. This way, the clues used to predict the brain activation patterns were not the same as those that had been presented to the participant (e.g. for a trial where we would probe elephant using clues “has legs”, “is thick-skinned” and “has a long trunk”, we would decode the brain activation patterns using features “gray”, “herd”, “tusk”, i.e. clues from another block). In the final model (6), we calculated the sum of semantic coordinates of all such features that were not presented in the guessing game task for a given target concept (akin to the All available features model but excluding the nine clue features used in the guessing game task).

The semantic coordinates were built in the following way. For each trial, we used word2vec to extract semantic coordinates for the implied target word, as well as all corresponding CSLB features and the clues used in the task. In cases where the clue/feature consisted of more than one word, we selected and lemmatized one key word (e.g., has legs → leg) and extracted the corresponding semantic coordinate from the corpus. For models, which combined many features together (i.e. Clue 1 + 2 + 3, Mixed clues, All nonclues, All available clues), we extracted the semantic coordinates of all features’ key words and then calculated the sum of the resulting semantic coordinates (see Fig. 5). Thus, all models used in the single-trial analyses resulted in a 360-by-300 matrix (i.e., number of trials × number of dimensions of the semantic space). In the analysis with averaged data, we used a 60-by-300 matrix (i.e. number of target objects × number of dimensions in the semantic space).

Fig. 5 Examples on how the different models were constructed. a The key word whose semantic coordinates were built using word2vec is shown in boldface. The semantic coordinate was either based on one word (i.e., Clue 3 and Target) or several words (i.e., Clue 1 + 2 + 3 and All available clues) in which case the final semantic coordinate was a sum of the semantic coordinates of all words in the respective model. b The sum of the resulting semantic coordinates (i.e. one 300-dimensional vector per each item) was entered into the zero-shot decoding analysis Full size image

FMRI data preprocessing

The preprocessing was performed using SPM8 software (Wellcome Trust Centre for Neuroimaging, University College London, UK) running on Matlab (MATLAB 2014a, MathWorks Inc., Natic, MA). The EPI volumes were first corrected for slice timing and head motion and coregistered to the structural volume of the same participant. We used a General Linear Model approach, where the model contained the head motion and session parameters as nuisance regressors as well as high-pass filtering. Each of the target objects in each of the six blocks was modeled by convolving a canonical hemodynamic response function from the onset of the last clue of a triplet. All analyses were run on native-space unsmoothed data. For visualization purposes, the data was co-registered to Montreal Neurological Institute (MNI) reference space60. Anatomical labeling was based on the AAL atlas61 unless otherwise cited.

Zero-shot decoding analyses

The machine learning analyses were run on Python 3 (www.python.org) using Anaconda3 distribution and the scikit-learn module62. The machine learning models implemented in this study evaluated the contributions of the brain activation patterns to each of the 300 dimensions in the semantic space (Fig. 1). The aim of these analyses was to test whether we can establish a statistically significant mapping between the brain activation patterns and the word2vec semantic coordinates.

The models were trained by using a subset (n = 58) of the altogether 60 targets and the respective multi-dimensional semantic coordinates such that, in the end, each semantic dimension was associated with a particular weighted activation pattern. For this, we used multiple regression with regularization parameters. The trained model can be used to predict the brain activation patterns of any novel concept outside the training set for which the corpus-derived semantic coordinates are available.

The model was evaluated after the training such that the predicted semantic coordinates of the two left-out objects were compared with the original corpus-derived (‘true’) semantic coordinates. The classification outcome was determined using cosine distance. We evaluated the level of statistical significance using a permutation test with 1000 iterations, randomly selected subjects and randomly shuffled order of the semantic coordinates across the target objects.

Zero-shot decoding on averaged data

In this analysis, the six repetitions with unique clue triplets for a given target object were averaged together into a single BOLD activation map using stability selection as described below. The zero-shot decoding model was trained by using 58 of the target items and the training data was used to predict the semantic coordinates of the two left-out target items. The training and evaluation process was iterated 1770 times to cover all leave-two-out combinations.

We focused the machine learning analysis of averaged data on a specific subset of voxels that showed a consistent activation pattern across the six trials of each target object32,33. First, we masked the native space beta images using an individual gray matter mask extracted from the SPM segmentation. We then extracted beta values for each voxel of each repeated trial (n = 6) of each object (n = 58, i.e., excluding the leave-two-out objects at each iteration). We then calculated pairwise Pearson correlations across the six repetitions of each target object and averaged the correlations over the 58 target objects in the training set. Finally, the 500 most stable voxels, i.e., those with the highest average correlation, were selected for further analyses.

Single-trial zero-shot decoding

In the single-trial analysis, no averaging was performed over the six trials of the same target object, but each trial using a unique clue triplet was considered a separate item. The brain activation patterns related to each trial were then used to predict the semantic coordinates (for details, please see section: Semantic space from text corpus data). First, the test pair was selected after which the other 5 + 5 trials corresponding to the same target concepts were removed from the training set. Thus, the zero-shot decoding model was trained on 348 trials, i.e. all 12 trials representing the two targets we tried to predict were excluded from the training set. Note that we did not use stability selection in the single-trial analysis, since there were no repeated trials over which stability selection could sensibly have been performed. Furthermore, as each trial had a different set of clues, we did not want to potentially wipe out this variability.

In the last step, we wanted to test which one of the trained models (i.e. using different sets of semantic coordinates as described above) provided the best mapping to the observed brain activation patterns. To this aim, the decoding accuracies between different models were compared using a pairwise t-test using a Bonferroni correction.

Visualization of the zero-shot results

To demonstrate the mapping between the brain and semantic space learned by the zero-shot decoding algorithm, we have created an interactive visualization (https://aaltoimaginglanguage.github.io/guess/) that shows, for each target object, its coordinates in the semantic space and the corresponding BOLD activation pattern, averaged across the six trials. T-distributed stochastic neighbor embedding (t-SNE)63 was used to obtain a two-dimensional visualization of the semantic space, and pycortex64 was used to visualize the BOLD activation pattern. To illustrate that the mapping between the brain and semantic space is defined at all coordinates, we added 19 new targets (mouse, parrot, chicken, goat, lynx, peach, grapefruit, beetroot, broccoli, lettuce, plane, screw, plate, watch, tape, tram, tank, dinghy, gondola) to the interactive visualization. By reversing the mapping to obtain a linear transformation between the semantic space and the brain65, BOLD activation patterns were predicted for these novel items.

Representational similarity analysis

In the RSA analysis, we used the single-trial data to maximize comparability with the decoding results. We used searchlight mapping46 and RSA toolbox66 running on MATLAB 2014a (The MathWorks, Inc., Natick, Massachusetts, United States) to find regions where similarity of activation patterns (activation pattern RDMs) was related to the semantic similarities of the implied target objects (model RDM).

The model RDM was based on All available features model. That is, the semantic coordinate of each trial was calculated as the sum of semantic coordinates of All available features of the implied target object (see model 4 above). The resulting model RDM was a 360 × 360 matrix, where the value in each cell reflects the cosine distance between the semantic coordinates of a pair of trials. The model RDM was compared to activation pattern RDMs which were constructed for each spherical searchlight (radius = 7 mm) across each voxel in the gray matter volume. The activation pattern RDMs were symmetrical 360 × 360 BOLD matrices, where the value in each cell reflects the dissimilarity (1–Pearson’s correlation) of activation patterns between a pair of trials. A whole-brain correlation map was produced by calculating Spearman’s rank correlations between the activity-pattern RDMs and semantic model RDMs. The correlation maps were Fisher transformed in order to make them normally distributed and projected back onto each searchlight’s center voxel.

The correlation maps of each participant were transformed into MNI space and smoothed at six FWHM. The resulting normalized and smoothed images of each participant were subjected to a group-level statistical nonparametric mapping analysis (one sample t-test) using variance smoothing of six FWHM and 10,000 permutations (SnPM13, version 13.1.06; http://go.warwick.ac.uk/tenichols/snpm). FMRI analyses are prone to increased risk of false positives as statistical tests are performed on a very large number of voxels. To deal with this problem, we indicate the pseudo-t values that survive the voxel-level FWE corrected p-threshold < 0.05 (height threshold: pseudo-t = 4.82). The uncorrected pseudo-t maps of the main RSA analysis67 are provided in an online repository: https://aaltoimaginglanguage.github.io/guess/. We further provide the results of alternative RSA analyses, using the remaining models applied in the single-trial decoding analyses.

Region of interest analysis

The PRC ROI was based on FreeSurfer’s (https://surfer.nmr.mgh.harvard.edu) probabilistic PRC label68. This label encompasses the medial bank of the collateral sulcus which corresponds to the Brodmann’s cytoarchitectonic field 35, i.e. the transentorhinal cortex69 (see also refs. 47,48). The surface-based labels were converted to volume-based ROIs after which the resulting ROI was manually inspected. When necessary, the ROIs were corrected manually such that they continuously covered the entire medial bank of the collateral sulcus.

The BOLD activation maps inside the PRC ROIs were averaged across the six repetitions of each target object. The voxel-wise BOLD signals in the left and right hemispheres were then concatenated resulting in a matrix with 60 rows (number of target items) and n columns, where the n corresponds to the total number of voxels in the left and right PRC ROI. These data were subjected to the zero-shot decoding scheme and used to predict the semantic coordinates of the 60 items. The semantic coordinates were built using the All available features model (i.e. All available features in the CSLB norm data25).

Code availability

The used software and algorithms are detailed in Supplementary Table 2. The custom code related to the zero-shot-learning algorithm and visualization code is available https://aaltoimaginglanguage.github.io/guess/.