Participants

Six healthy volunteers participated (age: 30.8 ± 7.8 years [mean ± SD]; five women and with musical experience, i.e., playing an instrument or singing–age of onset of musical training 6.7 ± 2.1 years; years of formal music training 12.8 ± 7.7 years, self-report of practice hours on their instruments/voice across lifetime summed over all instruments [years × hours per week × 52: 4,424 ± 4,198). All participants had normal hearing and no history of psychiatric or neurological disorders. Written informed consent was obtained from all participants. The study was approved by the Ethics and Scientific committees of the Copa D’Or Hospital, Rio de Janeiro, Brazil (No 442.648) and all experiments were performed in accordance with relevant guidelines and regulations.

Stimuli

Forty pieces with duration of 46 s from different genres (Classical Music, Rock, Pop, Jazz, and Folk with and without lyrics, see Supplementary Table 1) were selected for the fMRI experiment. The selection of these emotional categories was motivated by the fact that both emotions are positively valenced and represent two key dimensions of musical emotion while having a sufficient number of examples for each dimension. We refrained from including more dimensions to keep the duration of experiments within a comfortable range for participants. The emotional categories were confirmed for each piece by four participants during the selection process prior to the experiment and rated by all participants during the experiment (see Supplementary Fig. S3). All pieces were normalized to have the same volume (RMS normalization of Adobe Audition). Next, the 40 pieces were arranged into four different medleys concatenated with a linear fade-out and fade-in of 1 s each and no further rest period. In addition, to provide a more ecological setting, we separated the medleys into two joy- and two tenderness-evoking medleys, the order of music pieces was fixed (see Supplementary Table 1).

Experimental protocol

Each subject listened to the four medleys prior to the experiment and was asked to perform either one of the following two tasks: subjects should feel the emotion evoked by the piece (1) or perform an analytic task, i.e., tracking of harmony changes that prevented them from feeling the emotion (2). Training these tasks also made the music familiar to the participants prior to the scanning sessions. FMRI data were collected from each subject during four sessions on different days. During each session, subjects listened to the four 8-min medleys (in 4 runs), including a 20-s “warm-up” piece (total scanning time during music listening: 2 h 8 min per subject). Before each medley, subjects were instructed to perform either the analytical or the feeling task so that they performed twice each task on each medley during the four scanning sessions/days. The presentation order of different medleys was counterbalanced in different sessions for each participant. The analysis aggregated the data over the two tasks; the effect of the task is beyond the scope of this work. The musical stimuli were delivered with MRI-compatible headphones (MR-Confon, www.mr-confon.de) after equalizing the unequal frequency representation of the headphones with the equalizer function provided in Audacity (http://www.audacity.de). Prior to the experiment, the volume level was adjusted individually to provide the best subjective experience using a 20-s fMRI sequence together with normalized stimuli.

Sequence parameters

Functional images were acquired with a 3T Achieva scanner (Philips Medical Systems) using a T2*-weighted echoplanar (BOLD contrast) sequence (TR = 2000 ms, TE = 22 ms, matrix = 64 × 64, FOV = 240 mm, flip angle = 90°, voxel size 3.75 mm × 3.75 mm, slice thickness = 3.75 mm, gap = 1 mm, 24 slices; 245 volumes per run, four runs on a scanning day). Slices were sampled in equidistant mode to provide optimal musical experience (the low number of slices and the equidistant mode produced a constant background noise that resulted in an overall lower perceived scanning noise). The first 10 volumes corresponded to the warmup piece and the last five volumes were added to account for delays in the hemodynamic response function. Total functional scanning time per session was 32 min. A SENSE factor of 2 and dynamic stabilization were used.

Simultaneous to the fMRI sequence, we measured respiration, electrocardiogram (logged by the Philips Achieva scanner), and galvanic skin response (Brain Products, Germany).

Data preprocessing

All fMRI volumes of the four sessions of each subject were realigned to the first volume of the first run of the first session using the FMRIB Linear Image Registration Tool (FLIRT) from FSL49. Low-frequency drifts were removed with a Savitzky-Golay filter of 242 s and polynomial order three4. Realignment was inspected visually to confirm successful motion correction. Additionally, we used retrospective image correction50 to remove physiological noise in the BOLD signal. The phases of respiration and heartbeat cycles were determined and modelled with a fourth and third-order Fourier expansion, respectively, together with a first-order interaction, yielding 18 parameters. Also, we modelled the respiration and cardiac response function as suggested by previous studies51,52. The corresponding confound regressors were created using the Matlab PhysIO Toolbox53 with open source code available as part of the TAPAS software collection: http://www.translationalneuromodeling.org/tapas/). Galvanic skin response was artifact-corrected in BrainVision Analyzer (Brain Products) and convolved with the standard hemodynamic response function of SPM12. In total, 21 physiological confound regressors were subtracted from the BOLD signal after estimating their contribution by multiple linear regression analysis. The next two steps of preprocessing included global signal correction and temporal Gaussian filter with a width of 5 s12,54. Finally, each run was normalized to have zero mean and unit variance before entering the encoding analysis.

Musical feature extraction

We used an approach similar to that of ref.12 for acoustic feature extraction. Twenty-one acoustic features capturing timbral, rhythmical, and tonal properties were extracted from the four medleys using the MIRtoolbox8. The features were extracted using a frame-by-frame analysis approach commonly used in the field of Music Information Retrieval (MIR). The duration of the frame was 25 ms with a 50% overlap between two adjacent frames for the timbral features and 3 s with a 67% overlap for the rhythmical (pulse clarity) and tonal (key clarity) features. Timbral features included zero-crossing-rate, spectral centroid, brightness (high energy–low energy ratio divided at 1,500 Hz), spectral spread, spectral rolloff, spectral entropy, spectral flatness (Wiener entropy), roughness, root-mean-square (RMS) energy, and Sub-Band Flux (10 coefficients for 10 frequency bands [0–50 Hz; 50–100 Hz; 100–200 Hz; 200–400 Hz; 400–800 Hz; 800–1,600 Hz; 1,600–3,200 Hz; 3,200–6,400 Hz; 6,400–12,800 Hz; > 12,800 Hz]). Features were convolved with the double-gamma hemodynamic response function with parameters for the auditory response as described in ref.55. After convolution features were downsampled by retaining one sample for every 160 samples to match the sampling rate of the fMRI data (TR = 2 s). The same Savitzky-Golay filter applied to fMRI data was used for the convolved musical features. All four medleys were normalized as a whole rather than individually to have zero mean and unit variance.

A priori mask for voxel selection

An a priori mask was built from the AAL atlas including regions 79‒86, i.e., Heschl’s gyrus, superior and middle temporal gyrus, and superior temporal pole. The supramarginal gyrus was included partly, as the mask was dilated with “fslmaths -dilM 7” to be permissive for the voxel selection process, yielding 4,654 voxels at the functional resolution of 3.75 × 3.75 × 4.75 mm³. Next, the mask was transformed to the individual space, using the inverse transformation estimated from the segmentation of the anatomical image (as implemented in SPM12), which had previously been co-registered with the functional images. The mask was used in the main analysis as described in the following section.

Encoding analysis

The voxel-wise encoding model of the musical features was estimated on the training set by multiple linear regression analysis with ordinary least squares method. Regularization was not needed because the number of data points (4 scanning sessions × 38 pieces in training set × 23 TRs for each piece (46 seconds) = 3,496 volumes) was orders higher than the number of model parameters (21). In the training phase, volumes of all runs and sessions were aggregated, except the volumes of the two pieces that were left out for the test phase of the leave-two-music-pieces-out cross-validation scheme. Please note, that data of the repeated presentation of the same music piece were not used in the training data.

Voxel selection

To establish the voxel selection order for the decoding stage, inner training predictions were determined with a five-fold cross-validation scheme based on the training set that consisted of 3,496 data points (leaving-out two pieces). Therefore, the feature selection was based entirely on the training data, independently from the test data (and thus avoiding circular statistics). For each voxel, weights were estimated using 80% of the training set (2,977 time points) and then used to predict the BOLD time series of the 20% left-out data points (699 time points). Temporal correlations between the measured BOLD signal and the predicted time series were then averaged across the five folds. Voxels were ranked according to this calculated average of the five prediction correlation folds and used in this order during the decoding stage.

Decoding stage (Identification)

To eliminate any boundary effect, the volumes of the transition between pieces were left out (three volumes corresponding to the beginning and three volumes corresponding to the end of the piece, counted after considering three volumes for the hemodynamic delay). During the identification stage, the test pieces were predicted individually, yielding a total of 1,560 predictions for all two of 40 pairs (780 pairs, two test pieces). The BOLD signal of the four presentations of the same piece was averaged prior to prediction, yielding a single representation of each piece to predict. Predictions were evaluated using Pearson correlation coefficient by concatenating the spatial-temporal dimensions to a one-dimensional vector. A piece was considered to be correctly identified if the prediction correlation was higher with the correct musical features than with the non-correspondent features of the other piece in the test set.

The null distribution of identification accuracy was obtained by permutation tests. The same decoding procedure as described above was applied, however, during the identification the labels of the pieces were randomly permuted. For 100 voxels and complete stimulus length, the p-value of 0.05 corresponded to 62.5% identification accuracy.

Principal component analysis: Representation of musical features across auditory cortices

The representation of musical features across voxels was assessed with a principal component analysis. To avoid any bias possible introduced by any collinearity between the musical features, temporal correlations were calculated separately between each musical feature and the BOLD time series of each of the 300 best voxels identified during the encoding stage. Then, we calculated the principal components using the function princomp (MATLAB®) over these 21 separately estimated correlations to reveal the musical feature representation across the auditory cortices. The input matrix for the principal component analysis was of size [300 × 21] for each subject. The score matrix returned by the princomp function contains the scores for each voxel for the corresponding principal components represented in the columns. The first and second column were used for visualizing the scores of the first and second principal component, respectively, as shown in Fig. 4B. The first two columns of the loadings (COEFF matrix) of the musical features on the first two principal components are shown in Fig. 4C. We concentrate in the main text on the first two principal components which explain together about 71% of the variance across the 300 voxel weights. However, as parallel analysis and Velicer’s MAP test56 indicated four important components, we included the third and fourth component in the Supplementary Fig. S2. The same analysis was repeated for only the best 100 voxels to assess the similarity of the principal components between the larger (300 voxels) and the smaller (100 voxels) input space (within subject). Similarity across individual PCs was calculated by Pearson correlations between loadings of all pairs of subjects. To further confirm the correspondence between the principal components across subjects, group principal components (PCs) were calculated over concatenated voxels from all subjects yielding an input matrix of size [1800 × 21]. The Gale–Shapley stable marriage algorithm confirmed the same order of the components by matching each of individual components to the group PCs.

Entropy calculation

Shannon entropy of the feature space was calculated for each piece X. First, the similarity between musical feature vectors at all possible time point pairs over the entire stimulus set was calculated by Pearson correlation coefficient. Next, the range of these similarity values was divided into 10 equally spaced bins. These bins were then used to determine Shannon entropy57 for each piece, using all time point pairs belonging to this piece:

$$H(X)=-\sum _{i=1}^{10}{p}_{i}lo{g}_{2}{p}_{i}$$ (1)

where p i is the frequency of a similarity value between time points of piece X in bin i.

In sum, this metric reflects the variability of similarity values between musical features at all time point pairs within a piece.

Voxel dimensions for analysis and visualization

All imaging analyses were performed with the original voxel dimensions in the subject’s space. For display purposes, results were resampled to 1 mm³ and transformed to MNI standard space using the new segmentation algorithm of SPM12 (rev 6470). The visualization of inflated surfaces was done with pycortex (http://github.com/gallantlab/pycortex)58 using trilinear interpolation of the original voxel dimensions. Functional to anatomical alignments were checked visually for each subject by overlaying Heschl’s gyrus (extracted from the Harvard-Oxford-Atlas), transformed onto the functional reference volume of each individual.

Data availability

The datasets generated and analysed during the current study are available from the corresponding author on reasonable request. Dynamic visualizations of individual results are available at institutodor.github.io/mirviewer/S1.