The human voice provides a wealth of social information, including who is speaking. A salient voice in a child’s life is mother's voice, which guides social function during development. Here we identify brain circuits that are selectively engaged in children by their mother’s voice and show that this brain activity predicts social communication abilities. Nonsense words produced by mother activate multiple brain systems, including reward, emotion, and face-processing centers, reflecting how widely mother’s voice is broadcast throughout a child’s brain. Importantly, this activity provides a neural fingerprint of children’s social communication abilities. This approach provides a template for investigating social function in clinical disorders, e.g., autism, in which perception of biologically salient voices may be impaired.

The human voice is a critical social cue, and listeners are extremely sensitive to the voices in their environment. One of the most salient voices in a child’s life is mother's voice: Infants discriminate their mother’s voice from the first days of life, and this stimulus is associated with guiding emotional and social function during development. Little is known regarding the functional circuits that are selectively engaged in children by biologically salient voices such as mother’s voice or whether this brain activity is related to children’s social communication abilities. We used functional MRI to measure brain activity in 24 healthy children (mean age, 10.2 y) while they attended to brief (<1 s) nonsense words produced by their biological mother and two female control voices and explored relationships between speech-evoked neural activity and social function. Compared to female control voices, mother’s voice elicited greater activity in primary auditory regions in the midbrain and cortex; voice-selective superior temporal sulcus (STS); the amygdala, which is crucial for processing of affect; nucleus accumbens and orbitofrontal cortex of the reward circuit; anterior insula and cingulate of the salience network; and a subregion of fusiform gyrus associated with face perception. The strength of brain connectivity between voice-selective STS and reward, affective, salience, memory, and face-processing regions during mother’s voice perception predicted social communication skills. Our findings provide a novel neurobiological template for investigation of typical social development as well as clinical disorders, such as autism, in which perception of biologically and socially salient voices may be impaired.

The human voice is a critical social cue for children. Beyond the semantic information contained in speech, this acoustical signal provides a wealth of socially important information. For example, the human voice provides information regarding who is speaking, a highly salient perceptual feature that has been described as an “auditory face” (1). From the earliest stages of development, human listeners are extremely sensitive to the different voices in their environment (2), reflecting the importance of this social cue to human interaction and communication.

Listeners are particularly sensitive to the familiar voices encountered in their everyday environment, and arguably the most salient vocal source in a child’s life is mother’s voice. Mother’s voice is a constant and familiar presence in a child’s environment, beginning at a time when these vocal sounds and vibrations are conducted through the intrauterine environment to the fetus’ developing auditory pathways (3). Early exposure to mother’s voice facilitates recognition of this sound source and establishes it as a preferred stimulus: From the first days of life, children can identify their mother’s voice and will actively work to hear this sound source in preference to unfamiliar female voices (2). Throughout development, communicative cues in mother’s voice convey critical information to guide behavior (4⇓–6) and learning (7). For example, hearing a recording of one’s own mother’s voice is a source of emotional comfort for preschoolers during stressful situations, even when the content of the speech is meaningless (5). Furthermore, when school-age females experience a stressful situation, hearing their mother’s voice reduces children’s cortisol levels, a biomarker of stress, and increases oxytocin levels, a hormone associated with social bonding (4). These studies have highlighted the profound influence that mother’s voice has on children’s cognitive, emotional, and social function.

Despite the behavioral importance of mother’s voice for critical aspects of emotional and social development, little is known about the mechanisms by which socially salient vocal sources shape the developing brain. Near-infrared spectroscopy (8) and EEG (9) studies examining responses to mother’s voice have focused on young children (≤6 mo old) and have found increased neural activity for mother’s voice compared to female control voices; however, the methods used in these studies are unable to provide detailed information about the brain areas and functional circuits underlying the perception of mother’s voice. Therefore, a critical question remains: What are the neural representations of a biologically salient vocal source in a child’s brain?

To investigate this question, we used functional MRI (fMRI) and measured brain activity in 24 typically developing children (7–12 y old; see Tables S1 and S2) in response to their mother’s voice, an example of a highly socially salient vocal source in a child’s life. An important component of our experimental protocol included vocal recording sessions of each participant’s mother and two female control voices, both of whom are also mothers and were not known to the study participants, for subsequent presentation during functional brain imaging (Fig. 1A; see Methods and Audio Files S1–S6 for audio examples). During the recording sessions, mothers produced three four-syllable nonsense words, which were used to avoid activating semantic systems in the brain (10), thereby enabling a focus on the neural responses to each speaker’s vocal characteristics.

fMRI experimental design, acoustical analysis. and behavioral results. (A) Randomized, rapid event-related design: During fMRI data collection, three auditory nonsense words, produced by three different speakers, were presented to the child participants at a comfortable listening level. The three speakers consisted of the child’s mother and two female control voices. Nonspeech environmental sounds were also presented to enable baseline comparisons for the speech contrasts of interest. All auditory stimuli were 956 ms in duration and were equated for rms amplitude. (B) Acoustical analyses show that vocal samples produced by the participants’ mothers were similar to the female control voice samples for individual acoustical measures. (C) Results from behavioral ratings, collected in an independent cohort of children who did not participate in the fMRI study, show that female control voice samples were rated equally as pleasant as, and more exciting than, the mother’s voice samples. *P < 0.05; NS, not significant. (D) Children who participated in the fMRI study were able to identify their mother’s voice with high levels of accuracy, supporting the sensitivity of these young listeners to their mother’s voice. The horizontal line represents chance level for the mother’s voice identification task.

We had two primary goals for the data analysis. First, we wanted to probe neural representations and circuits elicited by mother’s voice across all participants. We hypothesized that the critical role of mother’s voice in social and emotional learning and its function as a rewarding stimulus would facilitate a distinct representation of this sound source in the minds of children, reflected by neural activity and connectivity patterns in auditory, voice-selective (11), reward (12), and social cognition (13) systems in the brain. The second goal of the analysis was to explore individual differences in brain responses to mother’s voice among children. We reasoned that children’s social communication and language function could potentially account for individual differences in brain responses to mother’s voice. Although it is established that children show a range of cognitive and language abilities, it also has been shown that they demonstrate a range of social abilities (14). Given the important contribution of mother’s voice to social communication (4⇓–6), we hypothesized that the strength of functional connectivity between voice-selective cortex and reward and affective processing regions would predict social function in neurotypical children.

To examine the robustness and reliability of these particular brain connections for predicting social communication scores, we performed a support vector regression (SVR) analysis ( 24 ⇓ – 26 ). Results showed that the strength of each of these brain connections was a reliable predictor of social communication function (left aSTS gPPI seed to left NAc: r = 0.62, P < 0.001; to right amygdala: r = 0.49, P = 0.004; to right hippocampus: r = 0.59, P < 0.001; to right fusiform: r = 0.54, P = 0.002; right pSTS gPPI seed to right OFC: r = 0.58, P < 0.001; to right AI: r = 0.66, P < 0.001; to right dACC: r = 0.66, P < 0.001).

Connectivity of right-hemisphere voice-selective cortex and social communication abilities. The whole-brain connectivity map shows that children’s social communication scores covaried with the strength of functional coupling between the right-hemisphere pSTS (Upper Left) and OFC of the reward pathway (Upper Right) and between the AI and dACC of the salience network (Lower). Scatterplots show the distributions and covariation of STS connectivity strength in response to mother’s voice and standardized scores of social function. Greater social communication abilities, reflected by smaller social communication scores, are associated with greater brain connectivity between the STS and these brain regions.

Connectivity of left-hemisphere voice-selective cortex and social communication abilities. The whole-brain connectivity map shows that children’s social communication scores covaried with the strength of functional coupling between the left-hemisphere aSTS (Top) and left-hemisphere NAc (Center Left), right-hemisphere amygdala (Center Right), right-hemisphere hippocampus (Bottom Left), and FG, which overlapped with the FG2 subregion (Bottom Right). Scatterplots show the distributions and covariation of aSTS connectivity strength in response to mother’s voice and standardized scores of social communication abilities. Greater social communication abilities, reflected by smaller social communication scores, are associated with greater brain connectivity between the STS and these brain regions. a.u., arbitrary units.

We then investigated individual differences in children’s brain connectivity by performing a regression analysis between the strength of STS connectivity and social and language measures. Results from whole-brain regression analyses showed a striking relationship: Children’s social communication scores, assessed using the Social Responsiveness Scale (SRS-2) ( 22 ), covaried with the strength of functional connectivity among multiple STS gPPI seeds and the brain systems identified in the univariate analysis ( Fig. 3 ). Specifically, standardized scores of social communication were correlated with the strength of brain connectivity for the [mother’s voice > female control voices] gPPI contrast between left-hemisphere anterior STS (aSTS) and left-hemisphere NAc of the mesolimbic reward pathway, right-hemisphere amygdala, hippocampus, and fusiform gyrus (FG), which overlapped with the FG2 subregion ( 19 ). Moreover, social communication scores were correlated with the strength of brain connectivity between right-hemisphere posterior STS (pSTS) and OFC of the reward system and the AI and dACC of the salience network ( Fig. 4 ). Scatterplots show that both brain connectivity and social communication abilities vary across a range of values and that greater social function, reflected by lower social communication scores, is associated with greater brain connectivity between the STS and these reward, affective, salience, and face-processing regions. In contrast, language abilities, assessed using the Core Language Score from Clinical Evaluation of Language Fundamentals, 4th edition (CELF-4) ( 23 ), correlated only with connectivity between left-hemisphere medial STS (mSTS) and right-hemisphere HG and inferior frontal gyrus ( Fig. S6 ).

The brain regions identified by the voxelwise analysis of mother’s voice identified multiple functional systems encompassing primary auditory and voice-selective temporal cortex, cortical structures of the visual ventral stream, and heteromodal regions associated with affective and reward function and salience detection. A prominent hypothesis states that the STS is a key node of the speech perception network that connects low-level auditory regions with heteromodal regions important for reward and affective processing of these sounds ( 21 ). Therefore, our next analysis examined the functional connectivity of the STS, using the generalized psychophysiological interaction (gPPI) model, with the goal of identifying the brain network that shows greater connectivity during mother’s voice compared to female control voice perception.

We next examined whether the presence of pleasant vocal features in the control voices could elicit increased activity in brain systems activated by mother’s voice ( Fig. 2 ). This analysis was based on independent behavioral ratings of the vocal stimuli, which revealed that vocal pleasantness ratings were significantly greater for one of the female control voices compared to the other control voice (P < 0.001). Both whole-brain and region of interest (ROI) analyses showed no differences in brain response between the two control voices in auditory, voice-selective, face-processing, reward, salience, or default mode brain regions (see SI Methods , Control voice analysis ). These results indicate that more intrinsically pleasant vocal characteristics alone are not sufficient to drive brain activity in the wide range of brain systems engaged by mother’s voice.

We next examined whether the extensive brain activation in response to mother’s voice ( Fig. 2 ) is specific to this stimulus or, alternatively, if a similar extent of activation is elicited by female control voices when compared to nonvocal environmental sounds. This particular comparison was used in a seminal study examining the cortical basis of vocal processing in adult listeners ( 11 ), and results from the current child sample are consistent with this previous work, showing strong activation in bilateral voice-selective STG and STS ( Fig. S5 ) for this contrast. Moreover, female control voices elicit activity in bilateral amygdala and supramarginal gyri and in left-hemisphere medial HG (mHG). Importantly, this analysis comparing female control voices and environmental sounds failed to identify reward, salience, and face-processing regions or the IC. Together, these results not only demonstrate that responses to mother’s voice are highly distributed throughout a number of brain systems but also show that activity in many of these regions, encompassing reward, salience, and face-processing systems, is specific to mother’s voice.

Signal levels in default mode (Upper) and occipital (Lower) regions in response to mother’s voice and female control voices. Regions were selected for signal-level analysis based on their identification in the [mother’s voice > female control voices] contrast ( Fig. 2 in the main text). Values plotted for mother’s voice and female control voices are referenced to duration- and energy-matched environmental sounds, e.g., [mother’s voice > environmental sounds]. All ROIs in these bar graphs are 5-mm spheres centered at the peak for these regions in the [mother’s voice > female control voices] contrast ( Fig. 2 in the main text). **P < 0.01.

Signal levels in mesolimbic reward regions (Upper) and the amygdala and salience network (Lower) in response to mother’s voice and female control voices. Regions were selected for signal-level analysis based on their identification in the [mother’s voice > female control voices] contrast ( Fig. 2 in the main text). Values plotted for mother’s voice and female control voices are referenced to duration- and energy-matched environmental sounds, e.g., [mother’s voice > environmental sounds]. NAc and amygdala ROIs are 2-mm spheres centered at the peak for these regions in the [mother’s voice > female control voices] contrast; all other ROIs in these bar graphs are 5-mm spheres centered at the peak for these regions in the [mother’s voice > female control voices] contrast ( Fig. 2 in the main text). **P < 0.01; *P < 0.05.

Signal levels in primary auditory regions (Upper) and voice-selective cortex (Lower) in response to mother’s voice and female control voices. Primary auditory regions were identified a priori from previous auditory studies (IC ROIs) ( 56 ) and from cytoarchitectonic maps (Te ROIs) ( 55 ), and voice-selective cortical regions were selected for signal-level analysis based on previous investigations of voice-selective cortex (bilateral pSTS; refs. 11 , 37 ) or their identification in the [mother’s voice > female control voices] contrast (bilateral mSTS and aSTS; see Fig. 2 in the main text). Values plotted for mother’s voice and female control voices are referenced to duration and energy-matched environmental sounds, e.g., [mother’s voice > environmental sounds]. The signal-level analysis was performed because stimulus-based differences in fMRI activity can result from a number of different factors. Significant differences were inherent to this ROI analysis, because they are based on results from the whole-brain GLM analysis ( 52 ); however, results provide important information regarding the magnitude and sign of fMRI activity. **P < 0.01; *P < 0.05.

Brain activity in response to mother’s voice. Compared to female control voices, mother’s voice elicits greater activity in auditory brain structures in the midbrain and superior temporal cortex (Upper Left), including the bilateral IC and primary auditory cortex (mHG) and a wide extent of voice-selective STG (Upper Center) and STS. Mother’s voice also elicited greater activity in occipital cortex, including fusiform gyrus (FG) (Lower Left), and in heteromodal brain regions serving affective functions, anchored in the amygdala (Upper Right), core structures of the mesolimbic reward system, including NAc, OFC, and vmPFC (Lower Center), and structures of the salience network, including the AI and dACC (Lower Right). No voxels showed greater activity in response to female control voices compared to mother’s voice.

In the fMRI analysis, we first identified brain regions that showed greater activation in response to mother’s voice compared to female control voices. By subtracting out brain activation associated with hearing female control voices producing the same nonsense words (i.e., controlling for low-level acoustical features, phoneme and word-level analysis, auditory attention, and other factors), we estimated brain responses unique to hearing the maternal voice. We found that mother’s voice elicited greater activity in a number of brain systems, encompassing regions important for auditory, voice-selective, reward, social, and visual functions. First, mother’s voice elicited greater activation in primary auditory regions, including bilateral inferior colliculus (IC), the primary midbrain nucleus of the ascending auditory system, and bilateral posteromedial Heschl’s gyrus (HG), which contains the primary auditory cortex ( Fig. 2 ). The auditory association cortex of the superior temporal plane, including bilateral planum temporale and planum polare, also showed significantly greater activation in response to mother’s voice, with slightly greater activation in the right hemisphere. Next, mother’s voice elicited enhanced bilateral activation in voice-selective superior temporal gyrus (STG) and superior temporal sulcus (STS), extending from posterior (y = −48) to anterior (y = 14) aspects of the lateral temporal cortex. Mother’s voice also elicited greater activity in the medial temporal lobe, including the left-hemisphere amygdala, a key node of the affective processing system. Structures of the mesolimbic reward pathway also showed greater activation in response to mother’s voice than to female control voices, including the bilateral nucleus accumbens (NAc) and the ventral putamen of the ventral striatum, orbitofrontal cortex (OFC), and ventromedial prefrontal cortex (vmPFC). Mother’s voice also elicited greater activation in posterior medial cortex bilaterally encompassing the precuneus and posterior cingulate cortex, a key node of the default mode network ( 17 ), which is a system involved in processing self-referential information ( 18 ). Additionally, mother’s voice elicited increased activity in multiple regions of the occipital cortex, including right-hemisphere intercalcarine, lingual, and fusiform cortex, including overlap with the FG2 subregion of the fusiform, which is associated with visual face processing ( 19 ). Greater activation also was evident in the anterior insula (AI) and the dorsal anterior cingulate cortex (dACC), two key structures of the salience network ( 20 ). Finally, preference for mother’s voice was evident in frontoparietal regions, including right-hemisphere pars opercularis [Brodmann area (BA) 44] and triangularis (BA 45), and in bilateral angular, supramarginal, and precentral gyri. The signal level in the majority of these brain regions showed increased activity relative to baseline in response to mother’s voice (see SI Methods and Figs. S1 – S4 for results from signal-level analysis). No brain regions showed significantly greater activation for female control voices compared to mother’s voice.

We next examined perceptual attributes of the stimuli. Of particular interest are the attributes associated with the pleasantness and excitement (a child-friendly proxy for “engagingness”) of the vocal samples: If the vocal characteristics of the mother’s voice samples are more rewarding and exciting than those of the female control voices, this difference could potentially account for brain effects associated with hearing mother’s voice. We administered a separate behavioral experiment in an independent cohort (i.e., children who did not participate in the fMRI study) of 27 elementary school children (mean age: 11.1 y). In this experiment, participants rated the 24 mother’s voice stimuli used in the fMRI experiment and the two female control stimuli based on how pleasant and exciting these voices sounded ( SI Methods ). We found no statistical difference between pleasantness ratings for the control voices and the mean pleasantness ratings for the mother’s voice samples ( Fig. 1C , Left); however, female control voices showed greater excitement ratings than the mother’s voice samples (P = 0.023) ( Fig. 1C , Right). Importantly, these behavioral results show that the vocal qualities of the two female control voices used in the fMRI experiment were equally as pleasant as, and were not less exciting than, the mother’s voice stimuli.

We conducted acoustical analyses and behavioral experiments to characterize the physical and perceptual attributes of mother’s voice and female control voice samples. The goal of these analyses was to determine if there were differences between mother’s voice and female control voice samples that could account for differences in fMRI activity beyond the biological salience of mother’s voice. Human voices are differentiated according to a number of acoustical characteristics, including features that reflect the anatomy of the speaker’s vocal tract, such as the pitch and harmonics of speech, and learned aspects of speech production, which include speech rhythm, rate, and emphasis ( 15 , 16 ). Acoustical analysis of the vocal samples used in the fMRI scan showed that control voice samples were qualitatively similar to mother’s voice samples across multiple spectrotemporal acoustical features ( Fig. 1B ).

SI Methods

Participants. The Stanford University Institutional Review Board approved the study protocol. Parental consent and children's assent were obtained for all evaluation procedures, and children were paid for their participation in the study. A total of 32 children were recruited from around the San Francisco Bay Area for this study. Six participants were excluded because of excessive movement, one was excluded because of infrequent contact with their biological mother, who also was unavailable for a vocal recording, and another participant was excluded because of scores in the “severe” range on standardized measures of social function. Parent reports from the final sample of 24 participants showed that these children were raised in families with a wide of range of socioeconomic backgrounds, with 25% of participants coming from households earning ≤$100K/y. Socioeconomic status was not correlated with children’s social communication skills as assessed using the SRS-2 (P > 0.50), the key behavioral measure described in the analysis. All children were required to have a full-scale IQ >80, as measured by the WASI (41). All children were right-handed and had no history of: neurological, psychiatric, or learning disorders; no personal or family (first degree) history of developmental cognitive disorders or heritable neuropsychiatric disorders; no evidence of significant difficulty during pregnancy, labor, delivery, or the immediate neonatal period; and no abnormal developmental milestones as determined by neurologic history and examination. Participants were the biological offspring of the mothers whose voices were used in this study (i.e., none of our participants were adopted, and therefore none of the mothers’ voices were from an adoptive mother), and all participants were raised in homes that included their mother. Participants’ neuropsychological and language characteristics are provided in Tables S1 and S2, respectively.

Data Acquisition Parameters. All fMRI data were acquired in a single session at the Richard M. Lucas Center for Imaging at Stanford University. Functional images were acquired on a 3-T Signa scanner (General Electric) using a custom-built head coil. Participants were instructed to stay as still as possible during scanning, and head movement was minimized further by placing memory-foam pillows around the participant’s head. A total of 29 axial slices (4.0-mm thickness, 0.5-mm skip) parallel to the anterior/posterior commissure line and covering the whole brain were imaged by using a T2*-weighted gradient-echo spiral in-out pulse sequence (43) with the following parameters: repetition time (TR) = 3,576 ms; echo time = 30 ms; flip angle = 80°; one interleaf. The 3,576-ms TR is the sum of (i) the stimulus duration of 956 ms; (ii) a 300-ms silent interval buffering the beginning and end of each stimulus presentation (600 ms total of silent buffers) to avoid backward and forward masking effects; (iii) the 2,000-ms volume acquisition time; and (iv) an additional 20-ms silent interval which helped the stimulus computer maintain precise and accurate timing during stimulus presentation. The field of view was 20 cm, and the matrix size was 64 × 64, providing an in-plane spatial resolution of 3.125 mm. Reduction of blurring and reduction of signal loss arising from field inhomogeneities was accomplished by the use of an automated high-order shimming method before data acquisition.

fMRI Task. Auditory stimuli were presented in 10 separate runs, each lasting 4 min. One run consisted of 56 trials of mother’s voice, female control voices, environmental sounds, and catch trials, which were pseudorandomly ordered within each run. Stimulus presentation order was the same for each subject. Each stimulus lasted 956 ms. Before each run, child participants were instructed to play the “kitty cat game” during the fMRI scan. While lying down in the scanner, children were first shown a brief video of a cat and were told that the goal of the cat game was to listen to a variety of sounds, including “voices that may be familiar,” and to push a button on a button box only when they heard kitty cat meows (catch trials). The function of the catch trials was to keep the children alert and engaged during stimulus presentation. During each run, four or five exemplars of each stimulus type (i.e., nonsense words produced by that child's mother, female control voices, and environmental sounds) and three catch trials were presented. At the end of each run, the children were shown another engaging video of a cat. Across the 10 runs, a total of 48 exemplars of each stimulus condition were presented to each subject. Speech stimuli were presented to participants in the scanner using E-Prime v1.0 (Psychological Software Tools, 2002). Participants wore custom-built headphones designed to reduce the background scanner noise to ∼70 adjusted dB (dBA) (44, 45). Headphone sound levels were calibrated before each data-collection session, and all stimuli were presented at a sound level of 75 dBA. Participants were scanned using an event-related design. Auditory stimuli were presented during silent intervals between volume acquisitions to eliminate the effects of scanner noise on auditory discrimination. One stimulus was presented every 3,576 ms, and the duration of the silent period was not jittered. The total silent period between stimulus presentations was 2,620 ms and consisted of a 300-ms silent period, 2,000 ms for a volume acquisition, another 300 ms of silence, and a 20-ms silent interval that helped the stimulus computer maintain precise and accurate timing during stimulus presentation.

fMRI Preprocessing. fMRI data collected in each of the 10 functional runs were subject to the following preprocessing procedures. The first five volumes were not analyzed to allow for signal equilibration. A linear shim correction was applied separately for each slice during reconstruction by using a magnetic field map acquired automatically by the pulse sequence at the beginning of the scan. Functional images were first realigned to their first volume using SPM8 analysis software (www.fil.ion.ucl.ac.uk/spm) and then corrected for deviant volumes resulting from spikes in movement. Translational movement in millimeters (x, y, z) was calculated based on the SPM8 parameters from the realignment procedure for each subject. We used a despiking procedure (46) similar to those implemented in the Analysis of Functional NeuroImages (AFNI) toolkit maintained by the National Institute of Mental Health (Bethesda, MD) (47). Volumes with movement exceeding 0.5 voxels (1.562 mm) or spikes in global signal exceeding 5% were interpolated using adjacent scans. The majority of volumes repaired occurred in isolation. After the interpolation procedure, images were further corrected for slice-timing errors. To normalize the functional images to standard Montreal Neurological Institute (MNI) space, each individual's functional images were first coregistered to their structural T1 images, and their T1 images were then transformed to MNI space. The transformation parameters were subsequently applied to the functional images, which were then resampled to 2-mm isotropic voxels. Finally, functional images were smoothed with a 6-mm full-width half-maximum Gaussian kernel to decrease spatial noise prior to statistical analysis.

Movement Criteria for Inclusion in fMRI Analysis. For inclusion in the fMRI analysis, we required that each functional run have a maximum scan-to-scan movement of <6 mm and that no more than 15% of volumes were corrected in the despiking procedure. Moreover, we required that all individual subject data included in the analysis consist of at least seven functional runs that met our criteria for scan-to-scan movement and percentage of volumes corrected; subjects who had fewer than seven functional runs that met our movement criteria were not included in the data analysis. All 24 participants included in the analysis had at least seven functional runs that met our movement criteria. Fifteen of the participants had 10 runs of data that met these movement criteria; two subjects had nine runs of data that met movement criteria; five subjects had eight runs of data; and two subjects had seven runs that met criteria.

Voxelwise Analysis of fMRI Activation. The goal of the voxelwise analysis of fMRI activation was to identify brain regions that showed differential activity levels in response to mother’s voice, female control voices, and environmental sounds. Brain activation related to each speech task condition was first modeled at the individual subject level using boxcar functions with a canonical hemodynamic response function and a temporal derivative to account for voxelwise latency differences in hemodynamic response. Environmental sounds were not modeled to avoid collinearity, and this stimulus served as the baseline condition. Low-frequency drifts at each voxel were removed using a high-pass filter (0.5 cycles/min), and serial correlations were accounted for by modeling the fMRI time series as a first-degree autoregressive process (48). Voxelwise t statistics maps for each condition were generated for each participant using the general linear model (GLM) along with the respective contrast images. Group-level activation was determined using individual-subject contrast images and second-level one-sampled t tests. The main contrasts of interest were [mother’s voice vs. female control voices], [female control voices vs. mother’s voice], [female control voices vs. environmental sounds], [female control voice 1 vs. female control voice 2], and [female control voice 2 vs. female control voice 1]. Significant clusters of activation were determined using a voxelwise statistical height threshold of P < 0.01, with familywise error corrections for multiple spatial comparisons (P < 0.01; 128 voxels) determined using Monte Carlo simulations (49, 50) using a custom Matlab script. To examine GLM results in the IC, a small subcortical brain structure, we used a small-volume correction at P < 0.01. To define specific cortical regions, we used the Harvard–Oxford probabilistic structural atlas (51) with a probability threshold of 25%.

Signal-Level Analysis. Group mean activation differences for key brain regions identified in the whole-brain univariate analysis were calculated to examine the basis for the [mother’s voice > female control voices] group differences (Fig. 2). This analysis was performed because stimulus differences can result from a number of different factors. For example, both mother’s voice and female control voices could elicit reduced activity relative to baseline, and significant stimulus differences could be driven by greater negative activation in response to female control voices. Significant stimulus differences were inherent to this ROI analysis, because they are based on results from the whole-brain GLM analysis (52); however, the results provide important information regarding the magnitude and sign of results in response to both stimulus conditions. The baseline for this analysis was calculated as the brain response to environmental sounds. A number of ROIs were constructed using coordinates reported in previous studies: IC ROIs were 5-mm spheres centered at ±6, −33, −11 (53, 54); primary auditory cortical ROIs (Te1.0, Te1.1, and Te1.2) were identified a priori from cytoarchitectonic maps (55), and bilateral pSTS coordinates were identified from previous investigations of voice-selective cortex (11, 37). All other ROI coordinates used in the signal-level analysis were based on peaks identified in the [mother’s voice > female control voices] group map. This analysis included 13 ROIs in bilateral STC, nine ROIs in bilateral frontal cortex, seven ROIs in bilateral parietal cortex, three ROIs in bilateral occipital cortex, one ROI in the anterior cingulate, and five subcortical structures. Cortical ROIs were defined as 5-mm spheres, and subcortical ROIs were 2-mm spheres, centered at the peaks in the [mother’s voice > female control voices] group map. Signal level was calculated by extracting the β-value from individual subjects’ contrast maps for the [mother’s voice > environmental sounds] and [female control voices > environmental sounds] comparisons. The mean β-value within each ROI was computed for both contrasts in all subjects. The group mean β and its SE for each ROI are plotted in Figs. S1–S4.

Effective Connectivity Analysis. Effective connectivity analysis was performed using gPPI (42), a method more sensitive than PPI to context-dependent differences in connectivity. At the individual subject level, the time series from the seed region is first deconvolved to uncover neuronal activity and then multiplied with the task design waveforms to form an interaction term. This interaction term is then convolved with the hemodynamic response function (HRF) to form the gPPI regressor, and the resulting time series is regressed against all other voxels in the brain. The goal of this analysis was to examine connectivity patterns of the voice-selective network, with a focus on voice-selective temporal cortex, which is hypothesized to be a hub of the network linking auditory regions with heteromodal regions important for reward and affective processing of these sounds (1, 21). Therefore, we constructed four STS/STG ROIs that were identified from the univariate analysis [mother’s voice > female control voices] group t map (Fig. 2; mSTS and aSTS/STG) and two ROIs that were identified from previous investigations of voice-selective cortex (pSTS) (11, 37). For the STS regions identified from the univariate analysis, we identified peaks in this t map for mSTS and aSTS/STG regions bilaterally and constructed nonoverlapping 5-mm spherical ROIs centered at these peaks. These six ROIs then were used as seeds in six separate whole-brain gPPI models. Significant clusters of activation were determined using a voxelwise statistical height threshold of P < 0.01, with familywise error corrections for multiple spatial comparisons (P < 0.01; 128 voxels) determined using Monte Carlo simulations (49, 50).

Brain-Behavior Analysis. Regression analysis was used to examine the relationship between brain signatures of mother’s voice perception and social and language skills. Social function was assessed using the Social Communication subscale of the SRS-2 (22). For our measure of language function, we used the CELF-4 (23), a standard instrument for measuring language function in neurotypical children. Regression analyses were conducted using the Core Language score of the CELF, a measure of general language ability. Brain-behavior relationships were examined using analysis of both activation levels and effective connectivity. We first performed a voxelwise regression analysis in which the relation between fMRI activity and social and language measures was examined using images contrasting mother’s voice vs. female control voices. We then performed a voxelwise regression analysis between STC connectivity and standardized social and language measures using gPPI images generated for each participant by contrasting responses to mother's voice vs. responses to female control voices. Significant clusters were determined using a voxelwise statistical height threshold of P < 0.01, with familywise error corrections for multiple spatial comparisons (P < 0.01; 128 voxels) determined using Monte Carlo simulations (49, 50).

Functional Brain Connectivity and Prediction of Social Function. To examine the robustness and reliability of brain connectivity between STS and reward, affective, salience detection, and face-processing brain regions for predicting social communication scores, we performed a confirmatory cross-validation (CV) analysis that employs a machine-learning approach with balanced fourfold CV combined with linear regression (25). In this analysis, we extracted individual subject connectivity beta values, taken from the [mother’s voice > female control voices] gPPI contrast, in left-hemisphere NAc, right-hemisphere amygdala, fusiform cortex, and hippocampus (i.e., the left-hemisphere aSTS gPPI seed) and in right-hemisphere OFC, AI, and dACC (the right-hemisphere pSTS gPPI seed). Mean gPPI beta values for each brain connection (e.g., left-hemisphere aSTS seed to left-hemisphere NAc) were separately entered as the independent variable in a linear regression analysis with SRS-2 social communication standard scores as the dependent variable. First, r (predicted, observed) , a measure of how well the independent variable predicts the dependent variable, was estimated using a balanced fourfold CV procedure. Data were divided into four folds so that the distributions of dependent and independent variables were balanced across folds. Data were randomly assigned to four folds, and the independent and dependent variables were tested in one-way ANOVAs, repeating as necessary until both ANOVAs were insignificant to guarantee balance across the folds. A linear regression model was built using three folds, leaving out the fourth, and this model was used to predict the data in the omitted fold. This procedure was repeated four times to compute a final r (predicted, observed) representing the correlation between the data predicted by the regression model and the observed data. Finally, the statistical significance of the model was assessed using a nonparametric testing approach. The empirical null distribution of r (predicted, observed) was estimated by generating 1,000 surrogate datasets under the null hypothesis that there was no association between social communication subscores and brain connectivity.

Stimulus Design Considerations. Previous studies investigating the perception (2, 5) and neural bases (8, 9) of mother’s voice processing have used a design in which one mother’s voice serves as a control voice for another participant. However, for a number of reasons, in this study we used a design in which all participants heard the same two control voices. First, we wanted to be able to perform analyses comparing brain responses between the two control voices (see Results, Analysis of Control Voices in the main text), which would not have been possible had the participants heard different control voices. There also was an important practical limitation with using mother’s voices as control voices for other children: Although we make every effort to recruit children from a variety of communities in the San Francisco Bay Area, some level of recruitment occurs through contact with specific schools, and in other instances our participants refer their friends to our laboratory for inclusion in our studies. In these cases, it is a reasonable possibility that our participants may know other mothers involved in the study and therefore may be familiar with these mothers’ voices; that familiarity would limit the control we were seeking in our control voices. Importantly, the Health Insurance Portability and Accountability Act (HIPAA) guidelines are explicit that participant information is confidential, and therefore there would be no way to probe whether a child knew any of the other families involved in the study. Given these analytic and practical considerations, we concluded that it would be best to use the same two control voices, which we knew were unfamiliar to the participants, for all participants’ data collection.

Stimulus Recording. Recordings of each mother were made individually while her child was undergoing neuropsychological testing. Mother’s voice stimuli and control voices were recorded in a quiet conference room using a Shure PG27-USB condenser microphone connected to a MacBook Air laptop computer. The audio signal was digitized at a sampling rate of 44.1 kHz and was A/D converted with 16-bit resolution. Mothers were positioned in the conference room to avoid early sound wave reflections from contaminating the recordings. To provide a natural speech context for the recording of each nonsense word, mothers were instructed to repeat three sentences, each of which contained one of the nonsense words, during the recording. The first word of each of these sentences was their child’s name, which was followed by the words “that is a,” followed by one of the three nonsense words. A hypothetical example of a sentence spoken by a mother for the recording was “Johnny, that is a keebudishawlt.” Before beginning the recording, mothers were instructed on how to produce these nonsense words by repeating them to the experimenter until the mothers had reached proficiency. Importantly, mothers were instructed to say these sentences using the tone of voice they would use when speaking with their child during an engaging and enjoyable shared learning experience (e.g., if their child asked them to identify an item at a museum). The vocal recording session resulted in digitized recordings of the mothers repeating each of the three sentences ∼30 times to ensure multiple high-quality samples of each nonsense word for each mother. A second class of stimuli included in the study was nonspeech environmental sounds. These sounds, which included brief recordings of laundry machines, dishwashers, and other household sounds, were taken from a professional sound effects library.

Stimulus Postprocessing. The goal of stimulus postprocessing was to isolate the three nonsense words from the sentences that each mother spoke during the recording session and to normalize them for duration and rms amplitude for inclusion in the fMRI stimulus presentation protocol, the pleasantness and excitement ratings experiment, and the mother’s voice identification task. First, a digital sound editor (Audacity: https://sourceforge.net/projects/audacity/) was used to isolate each utterance of the three nonsense words from the sentences spoken by each mother. The three best versions of each nonsense word were selected based on the audio and vocal quality of the utterances (i.e., eliminating versions that were mispronounced, included vocal creak, or were otherwise not ideal exemplars of the nonsense words). These nine nonsense words then were normalized to 956 ms in duration, the mean duration of the nonsense words produced by the female control voices, using Praat software similar to previous studies (56). A 10-ms fade (ramp and damp) was performed on each stimulus to prevent click-like sounds at the beginning and end of the stimulus, and then stimuli were equated for rms amplitude. These final stimuli were evaluated for audibility and clarity to ensure that postprocessing manipulations had not introduced any artifacts into the samples. The same process was performed on the control voices and environmental sounds to ensure that all stimuli presented in the fMRI experiment were the same duration and rms amplitude.

Pleasantness and Excitement Ratings for Vocal Stimuli. To examine the relative pleasantness and engagingness of the vocal stimuli used in the fMRI experiment, we performed two behavioral experiments in an independent cohort of 27 children (mean age ± SD: 11.1 ± 1.2 y; sex: 10 female, 17 male). Participants were seated in a quiet room in front of a laptop computer, and headphones were placed over their ears. In one experiment, participants were presented with trials of either a mother’s voice sample or a control voice sample of the nonsense word “teebudishawlt.” After each stimulus presentation, the participant rated the vocal sample for pleasantness on a four-point scale as “very unpleasant,” “unpleasant,” “pleasant,” or “very pleasant.” In the second experiment the same procedures were used, but participants rated each vocal sample on a four-point scale for engagingness. Because there was concern that 8- to 10-y-old children might not understand the meaning of the word “engaging” in the context of this experiment, consistent with a previous study (57), we used the following four-point scale for this experiment: “totally boring,” “a little boring,” “a little exciting,” or “totally exciting.” Each vocal stimulus (i.e., 24 mother’s voice samples plus the two control voice samples) was presented once to each child in both the pleasantness and engagingness ratings tasks. The order of stimulus presentation was randomized for each participant and experiment; half of the participants performed the pleasantness ratings task first, and the other half of the participants performed the engagingness ratings task first. The vocal samples used in these behavioral experiments are the same as those used in the fMRI experiment. To examine statistical differences between ratings for mother’s voice and female control voices, we performed independent samples t tests comparing the mean ratings for mother’s voice samples and participant ratings for both control voices.