Participants

One hundred twenty two 3.5–12-year-old children (M(s.d.) = 6.7(2.3); 64 females) participated in the study. 110 children were right-handed and 3 were ambidextrous (as indicated by parent or legal guardian). This sample includes 65 children under the age of 6 years (M(s.d.) = 4.82(.81) years; 34 females; 54 RH/3 Ambi); this subset of children were used to test for neural differences between children who pass (n = 30; M(s.d.) = 5.2(.70); 15 females; 26 RH/2 Ambi) and fail (n = 15; M(s.d.) = 4.08(.56); 6 females; 11 RH/4 LH) false-belief tasks. Twenty children in this subset responded inconsistently to false-belief tasks (M(s.d.) = 4.75(.73); 13 female; 17 RH/1 Ambi). An additional 19 children were recruited to participate and excluded from all analyses for not completing or participating in the study (n = 12), language delays (n = 2), and excessive motion during the fMRI scan (n = 5; see fMRI Data Analysis for details). Thirty three adult participants (ages 18–39 years; M(s.d.) = 24.8(5.3); 20 females; 32 RH/1 LH) additionally participated in the fMRI portion of the study. Child and adult participants were recruited from the local community. All adult participants gave written consent; parent/guardian consent and child assent was received for all child participants. Recruitment and experiment protocols were approved by the Committee on the Use of Humans as Experimental Subjects (COUHES) at the Massachusetts Institute of Technology.

fMRI stimuli

Participants watched a silent version of ‘Partly Cloudy,’61, a 5.6-min animated movie38. A short description of the plot can be found online (https://www.pixar.com/partly-cloudy#partly-cloudy-1). Previous research suggests that pediatric populations move significantly less during fMRI scans using movie stimuli62. The stimulus was preceded by 10 s of rest, and participants were instructed to watch the movie and remain still. Participants aged five and older completed additional tasks prior to viewing this stimulus; these tasks largely involved listening to (children) or reading (adults) stories.

fMRI data acquisition

Prior to the scan, child participants completed a mock scan in order to become acclimated to the scanner environment and sounds, and to learn how to stay still. Children were given the option to hold a large stuffed animal during the fMRI scan in order to feel calm and to prevent fidgeting. An experimenter stood by child participants’ feet, near the entrance of the MRI bore, to ensure that the participant remained awake and attentive to the movie. If this experimenter noticed participant movement, she placed her hand gently on the participant’s leg, as a reminder to stay still.

Whole-brain structural and functional MRI data were acquired on a 3-Tesla Siemens Tim Trio scanner located at the Athinoula A. Martinos Imaging Center at MIT. Children under age 5 years used one of two custom 32-channel phased-array head coils made for younger (n = 3, M(s.d.) = 3.91(.42) years) or older (n = 28, M(s.d.) = 4.07(.42) years) children63; all other participants used the standard Siemens 32-channel head coil. T1-weighted structural images were collected in 176 interleaved sagittal slices with 1 mm isotropic voxels (GRAPPA parallel imaging, acceleration factor of 3; adult coil: FOV: 256 mm; kid coils: FOV: 192 mm). Functional data were collected with a gradient-echo EPI sequence sensitive to Blood Oxygen Level Dependent (BOLD) contrast in 32 interleaved near-axial slices aligned with the anterior/posterior commissure, and covering the whole brain (EPI factor: 64; TR: 2 s, TE: 30 ms, flip angle: 90°). As participants were initially recruited for different studies, there are small differences in voxel size and slice gaps across participants (3.13 mm isotropic with no slice gap (n = 5 adults, n = 3 7yos, n = 20 8–12yo); 3.13 mm isotropic with 10% slice gap (n = 28 adults), 3 mm isotropic with 20% slice gap (n = 1 3yo, n = 3 4yo, n = 2 7yo, n = 1 9yo); 3 mm isotropic with 10% slice gap (all remaining participants)); all functional data were subsequently upsampled in normalized space to 2 mm isotropic voxels. Prospective acquisition correction was used to adjust the positions of the gradients based on the participant’s head motion one TR back64. 168 volumes were acquired in each run; children under age five completed two functional runs, while older participants completed only one run. For consistency across participants, only the first run of data was analyzed. Four dummy scans were collected to allow for steady-state magnetization.

fMRI data analysis

FMRI data were analyzed using SPM8 (http://www.fil.ion.ucl.ac.uk/spm)65 and custom software written in Matlab and R. Functional images were registered to the first image of the run; that image was registered to each participant’s anatomical image, and each participant’s anatomical image was normalized to the Montreal Neurological Institute (MNI) template. This enabled us to use group regions of interest (ROIs) and hypothesis spaces created in adult data sets, and to directly compare responses between child and adult participants. Previous research has suggested that anatomical differences between children as young as 7 years are small relative to the resolution of fMRI data, which supports usage of a common space between adults and children of this age (for similar procedures with children under age seven, see refs 21,66,67; for methodological considerations, see ref. 68). Registration of each individual’s brain to the MNI template was visually inspected, including checking the match of the cortical envelope and internal features like the AC–PC and major sulci. All data were smoothed using a Gaussian filter (5 mm kernel).

Artifact timepoints were identified via the ART toolbox (https://www.nitrc.org/projects/artifact_detect/)69 as timepoints for which there was (1) >2 mm composite motion relative to the previous timepoint or (2) a fluctuation in global signal that exceeded a threshold of three s.d. from the mean global signal. Participants were dropped if one-third or more of the timepoints collected were identified as artifact timepoints; this resulted in dropping five child participants from the sample (see Participants). Number of artifact timepoints differed significantly between child and adult participants (Child (n = 122): M(s.d.) = 10.5(10.6), Adult (n = 33): M(s.d.) = 2.8(4), Welch two-sample t-test: t(137.7) = 6.49, p < .000001). Among children, number of motion artifact timepoints was not correlated with age (spearman correlation test (n = 122): r s (120) = .02, p = .86) or ToM score (kendall tau correlation test (n = 122): r k (120) = −.005, p = .94). Number of artifact timepoints did not differ between young (3–5-year old) children based on false-belief task performance (linear regression tests for effect of FB-Group on number of motion artifact timepoints: NS effect of FB-group (Pass (n = 30) vs. Fail (n = 15)): b = −.04, t = −.12, p = .9; NS effect of FB-group (Pass (n = 30), Inc (n = 20), or Fail (n = 15)): b < .05, p > .9) or response inhibition (linear regression test for effect of DCCS on number of motion artifact timepoints (n = 64): NS effect of DCCS summary score: b = .16, t = 1.18, p = .25). See Supplementary Fig. 8 for visualization of the amount of motion per age group. Despite amount of motion being matched across children, and therefore likely not driving developmental effects within the child sample, we include number of motion artifact timepoints as a covariate in all analyses. Number of artifact timepoints is highly correlated with measures of mean translation, rotation, and distance (r > .8). Because this measure is not normally distributed, spearman correlations were used when including amount of motion as a covariate in partial correlations.

Region of interest (ROI) analyses were conducted using group ROIs. ToM and pain matrix group ROIs were created in an independent group of adults (n = 20), scanned by Evelina Fedorenko and colleagues. These data were preprocessed and analyzed with procedures identical to those used for participants in the current study. Reverse correlation analyses were conducted in this separate group of adults, using 10 mm group ROIs surrounding peaks reported in previous publications (ToM regions70; Pain matrix71). Seven ToM and nine pain events were identified (ToM: 60 s total, M(s.d.) length: 8.6(4.6)s, Pain: 66 s total, M(s.d.) length: 7.3(4.4)s). We subsequently used a general-linear model to analyze BOLD activity of these participants as a function of condition, using these events. Second-level random effects analyses were used to examine the group-level response to Mental > Pain and Pain > Mental (p < .001, k = 10, uncorrected). We then drew 9 mm spheres surrounding the peak activation in each region, to create new group ROIs that were tailored to the stimulus, but defined in an independent sample of adults (see Supplementary Fig. 1 and Supplementary Table 2 for more information on all group ROIs, and Supplementary Fig. 7 for details of the convergence between events across the two adult samples and ROIs).

All timecourse analyses were conducted by extracting the preprocessed timecourse from each voxel per group ROI. We applied nearest neighbor interpolation over artifact timepoints (for methodological considerations on interpolating over artifacts before applying temporal filters, see refs 72,73), and regressed out two kinds of nuisance covariates to reduce the influence of motion artifacts: (1) motion artifact timepoints; and (2) five principle component analysis (PCA)-based noise regressors generated using CompCor within individual subject white matter masks74. White matter masks were eroded by two voxels in each direction, in order to avoid partial voluming with cortex. CompCor regressors were defined using scrubbed data (e.g., artifact timepoints were identified and interpolated over prior to running CompCor).

For inter-region correlation analyses only, we additionally regressed out the raw timecourse extracted from bilateral primary motor cortex (M1). Primary motor cortex ROIs were 10 mm spheres drawn around peak coordinates generated with Neurosynth (http://neurosynth.org/; search term: “primary motor,” forward inference from 273 studies; coordinates: [38,−24,58], [−38,−20,58]). These ROIs are included in the expanded inter-region correlation analysis shown in Supplementary Fig. 4; the bilateral M1 timecourse was not regressed out for this supplemental analysis. However, because this analysis showed that the within-M1 inter-region correlation increases with age among children, we regressed the bilateral M1 timecourse from the ToM and Pain timecourses for the inter-region correlation analyses reported in the main text, to ensure that the age effects in the ToM and pain networks are above and beyond developmental effects present in regions like primary motor cortex, and that within-network correlations are not falsely inflated by commonalities in signal fluctuation across the brain.

The residual timecourses were then high-pass filtered with a cutoff of 100 s. Timecourses from all voxels within an ROI were averaged, creating one timecourse per group ROI, and artifact timepoints were subsequently excluded (NaNed).

In inter-region correlation analyses, each ROI timecourse was correlated with every other ROI’s timecourse, per subject, and these correlation values were Fisher z-transformed. Within-ToM correlations were the average correlation from each ToM ROI to every other ToM ROI, within-Pain correlations were the average correlation from each Pain ROI to every other Pain ROI, and across-network correlations were the average correlation from each ToM ROI to each Pain ROI. This procedure is similar to that used by ref. 42. In order to test for developmental change in within-network and across-network correlations, we conducted linear regressions to test for (1) significant differences between adults and children, in regressions that included group (child vs. adult) and number of artifact timepoints as predictors, (2) significant effects of age (as a continuous variable), ToM performance, and number of artifact timepoints among children, and (3) significant group differences between children who pass and fail explicit false-belief tasks, including number of artifact timepoints and age as predictors. In order to test whether ToM and pain networks are coherent and specialized early in childhood, we used t-tests to compare within-network versus across-network correlations in 3-year-old children (n = 17). Within-network and across-network correlation measures were normally distributed (p > .22, one-sample Kolmogorov–Smirnov tests), and variance in within-ToM, within-Pain, and across-network correlations did not differ across children and adults, or false-belief passers vs. failers (F-tests to compare two variances: children (n = 122) vs. adults (n = 33): F(32,121) > 1.1, p > .66; pass (n = 30) vs. fail (n = 15): F(14,29) > .78, p > .65).

Initial reverse correlation analyses were conducted on adult participants only. Each ROI timecourse was z-normalized, and timecourses within each network were averaged across ROIs, resulting in one timecourse for face regions, ToM regions, and the pain matrix per adult participant. Except for the first 10 timepoints (5 TRs rest, followed by 5 TRs of the movie introduction (Disney castle and Pixar logos)), the residual signal values across adult subjects for each timepoint were tested against baseline (0) using a one-tailed t-test. This procedure is similar to that used by41. Events were defined as two or more consecutive significantly positive timepoints within each network. Events were rank-ordered according to the average magnitude of response to the peak timepoint in adults, and labeled according to the ordering (e.g., event T01 is the ToM event that evoked the highest magnitude of response in the ToM network).

In adults, we conducted an overlap analysis to determine whether the number of timepoints labeled as both ToM and pain events was statistically fewer than would occur by chance. We constructed 1000 permutations of ToM and pain timecourses, which had the same number and duration of events. The constructed timecourses were 158 TRs in length (the experiment was 168 TRs; the first 10 TRs were excluded from the reverse correlation analysis because the movie started on TR 11). For each permutation, we randomly scrambled the order of ToM and pain events. We then filled in the timepoints between events with zeros, with a random proportion of zeros between events such that the total number of zeros was equal to the total number of non-event timepoints in the original timecourses (ToM: 125 TRs; Pain: 116 TRs). Events within a timecourse (ToM or Pain) necessarily had to be separated by at least one timepoint, since they would otherwise be counted as a single event. The first event of each timecourse could be preceded by zero zeros, and the last event of each timecourse could be followed by zero zeros. We calculated the sum of the number of timepoints tagged as ToM and pain events in each pair of permutations (ToM and pain timecourses), and subsequently calculated the proportion of permutations that resulted in the same or a smaller amount of overlap as observed in the reverse correlation analysis.

In order to test for developmental effects in the magnitude of response to ToM and pain events, we defined a peak timepoint per event as the timepoint with the highest average signal value in adults, and tested for significant correlations between the magnitude of response at peak timepoints and age (as a continuous variable), including amount of motion (number of artifact timepoints) as a covariate. Because this measure of motion is non-normally distributed, we employed spearman correlations. For ToM regions only, we used linear regressions to test for a significant relationship between peak magnitude of response and theory of mind behavior (overall, in all children), and to test whether responses at peak timepoints differed between children who pass (n = 30) and fail (n = 15) explicit false-belief tasks. Response magnitude at all peak events was normally distributed (all p > .23, one-sample Kolmogorov–Smirnov test). Response magnitudes showed similar variance across false-belief task passers (n = 30) and failers (n = 15) (F-tests to compare two variances: all F(13,28) > .7, p > .07), with the exception of one event (T03: F(14,28) = .30, p = .02). A permutation test was used to test for group differences in magnitude of response to this event75. We ran the reverse correlation analysis in 3-year-old participants only (n = 17), in order to examine response specificity at this young age, and to better understand developmental differences.

Finally, we tested whether the functional maturity of each child’s timecourse responses (i.e., similarity to adults) was related to the inter-region network correlations. We calculated the pearson correlation between each child’s ToM timecourse (averaged across ROIs) and the average adult ToM timecourse; we similarly calculated the pearson correlation between each child’s pain matrix timecourse and the average adult pain matrix timecourse. The timecourses used for this analysis were the same as those used for the reverse correlation analysis, prior to z-normalization (TRs 11:168). We tested whether, across children, this measure of functional maturity per network was correlated with within-network and across-network inter-region correlations, or related to ToM behavior. The neural maturity measure was normally distributed in both networks (p > .29, one-sample Kolmogorov–Smirnov test). Variance in this measure in the ToM network did not differ between children who pass (n = 30) and fail (n = 15) false-belief tasks (F test to compare two variances: F(14,29) > 1.00, p > .95). We additionally calculated and report the pearson correlation between the average timecourse of children in each age group and the average adult timecourse.

All of the analyses reported in this manuscript should be considered exploratory, not confirmatory, in that the analyses described here were not chosen prior to data collection, and data collection was not completed with this specific set of analyses in mind. While we deliberately chose this stimulus in order to measure neural responses in very young children (ages 3–4 years), older children visited the lab to participate in a different study, and additionally completed the protocol of the current study. We then recognized the opportunity of analyzing the full cross-sectional dataset, and chose analyses based on the stimulus (time series analyses seemed to utilize more data and be more sensitive than previous analysis methods38), and on recent relevant progress in the field42,76,77.

Behavioral battery

After the scan, all children completed a behavioral task battery including (in order) an explicit theory of mind battery and a measure of nonverbal IQ (under 5 years: WPPSI block design78, over 5 years: nonverbal KBIT-II79). Children under age seven then completed a computerized version of the Dimensional Change Card Sort task as a measuring of response inhibition. Performance on DCCS was captured using the summary score44; one child (an inconsistent FB task performer) failed to complete the DCCS task.

Explicit ToM task and false-belief composite score

All children completed a custom-made explicit ToM battery21 (https://osf.io/G5ZPV/), which involved listening to an experimenter tell a story and answering prediction and explanation questions that required reasoning about the mental states of the characters. Because this task was designed to capture variability in ToM reasoning across a wide age-range of children, the questions varied in difficulty. Easier items involved reasoning about similar and diverse desires, true beliefs, and emotion prediction; harder items included reasoning about false beliefs, moral blame-worthiness, and second-order false beliefs. Two analogous booklets were used; children ages 3–4 and 10–12 years old listened to a story about students finding snacks, and 5-year-old children listened to a story about students finding books; 7–9 year-old-children were split among the books (snacks: n = 16; books: n = 33). Different booklets were used across children because children of different ages participated in different studies that all involved the current protocol. However, the two booklets were designed for repeated measures designs: analogous stories and questions across the two booklets had identical syntax, but different semantic content: one story was about helping children find their books, the other was about finding snacks. A previous study using the ‘finding books’ booklet suggests the validity of this task to capture theory of mind development in children ages 5–12 years old21. These booklet tasks and instructions are available on the Open Science Framework (https://osf.io/G5ZPV/; DOI: 10.17605/OSF.IO/G5ZPV; ARK: c7605/osf.io/g5zpv).

Each child’s performance on the ToM battery was summarized as the proportion of questions answered correctly, out of 24 matched items (14 prediction items and 10 explanation items). An additional two control items were asked to ensure that children were paying attention; after ensuring all children answered these questions correctly, these items were not further analyzed. Children ages 3–5 years old were also categorized based on their performance on a false-belief composite score based on six explicit false-belief questions (4 prediction, 2 explanation) within the ToM booklet. These six questions were chosen because they were canonical explicit false-belief questions describing changes in location or unexpected contents11,12,80. The composite score demonstrated acceptable reliability (Cronbach’s α = .71). Children were categorized as explicit false-belief ‘passers’ if they answered five or six out of six false-belief questions correct, ‘inconsistent performers’ if they answered three or four questions correct, and ‘failers’ if they answered zero to two questions correct.

We tested for significant correlations between age, DCCS and ToM, and for differences in these scores between children who pass and fail false-belief tasks. We used Kendall’s rank correlation tau, given non-normal distributions of the ToM score (Shapiro–Wilk normality test: w = .9, p < .00001) and DCCS score (w = .75, p < .00001), and given the frequency of ties in both of these measures.

Code availability

The analysis code used to generate the findings of this study is available from the corresponding author upon request.

Data availability

The fMRI and behavioral data collected and analyzed during the current study are available through the OpenfMRI project (https://openfmri.org/; Link: https://www.openfmri.org/dataset/ds000228/ DOI: 10.5072/FK2V69GD88). The ToM behavioral battery is additionally available through OSF (https://osf.io/G5ZPV/; DOI: 10.17605/OSF.IO/G5ZPV; ARK: c7605/osf.io/g5zpv). The corresponding author welcomes any additional requests for materials or data.