Study design

We report an assessor-blinded, parallel-group RCT47,48 of a music intervention (MT) compared to a non-music control intervention (NM) for improving social communication and fronto-temporal brain connectivity in school-age children with ASD. The trial (isrctn.org: ISRCTN26821793) was conducted between April and December 2016 in Montreal, Canada with ethics approval from the Montreal Neurological Institute (MNI) at McGill University. Written informed consent was obtained from parents/guardians of participants.

Participants

Children aged 6–12 years, meeting Diagnostic and Statistical Manual of Mental Disorders, Fourth edition criteria for ASD49, were screened from January to August 2016 (Fig. 1). Exclusion criteria were (1) individual music therapy within 6 months prior to study, (2) private musical lessons for a cumulative period of 1 year prior to study, (3) group music therapy in school; (4) <35 weeks of gestation, (5) hearing disorders or (6) a medical history of neurological disease. Power analysis using evidence-based effect size estimation11 suggested that for a large effect (d = 0.8) of music therapy on social communication, detectable with 80% power at P < .05, a sample of n = 50 (25 per arm) would be required.

Fig. 1: CONSORT study diagram. CONSORT study diagram study comparing neurobehavioural outcomes of a music intervention compared with a non-music intervention for children with Autism Spectrum Disorder Full size image

Baseline assessment

Assessment at baseline consisted of two sessions. In the first session, detailed demographics on socioeconomic status50 (SES), handedness51, music experience history and past and current intervention history of the child were obtained. Participant diagnosis was confirmed using a best-estimate diagnosis of ASD supported by an ADOS (Autism Diagnostic Observation Scale52), Autism Diagnostic Interview–Revised53 or Childhood Autism Rating Scale54 and detailed clinical assessment report. Additionally, parent-reported behavioural outcomes on Social Responsiveness Scale (SRS-II55), the Children’s Communication Checklist (CCC-256), the maladaptive behaviour subscale of the Vineland Adaptive Behaviour Scales (VABS-MB57) as well as the Beach Family Quality of Life Scale (FQoL58) were obtained. Children’s cognitive ability was assessed using the Wechsler’s Abbreviated Intelligence Scale (WASI-II59). If the child had completed an intelligence quotient (IQ) test (WASI-I/II/WISC-IV/V) within 2 years of the study, available scores were used. Children’s language ability was assessed using the Sentence Repetition subtest of the Clinical Evaluation of Language Fundamentals (CELF-460,61) and receptive vocabulary was measured using the Peabody Picture Vocabulary Test (PPVT-462). Musical ability was assessed using the Montreal Battery for Evaluation of Musical Abilities63. Detailed baseline characteristics of participants are provided in Table 1.

Table 1 Baseline characteristics of participants Full size table

In the second session, participants completed a 20-minute MRI scan in a 3 Tesla Siemens Magnetom TimTrio scanner with a 32-channel head coil at the MNI. During this scan, participants were asked to fixate on a cross-hair on the screen. Resting-state BOLD echo-planar images were obtained in 38 slices with a 3.5 mm3 voxel resolution, covering the entire brain (TR = 2340 ms, TE = 30 ms, matrix size, 64 × 64; field of view (FOV), 224 mm; flip angle 90°). One hundred and forty volumes were obtained in 5 minutes 32 s. Participants also completed a high-resolution sagittal T1-weighted anatomical scan with a voxel resolution of 1 mm3 and an acceleration factor of 2. Participants with their parents underwent a detailed orientation procedure before the MRI scan to ensure comfort and compliance and to maximize good quality outcomes64. Audio-visual media aids and mock scanner trials were used in most cases to motivate the participants. Participants’ wakefulness and motion during the actual scans was monitored using an MRI-compatible infra-red camera.

Randomization and blinding

Fifty-one participants were randomized to MT (n = 26) or NM (n = 25) using the covariate-adaptive method65 where the first 20 participants were randomized using simple coin toss and remaining 31 by the MinimPy software (http://minimpy.sourceforge.net/) by the first author (M.S.), who was not involved in assessing behavioural outcomes. MinimPy is a free, open-source, desktop program implemented in Python, which allows random allocation of subjects to treatment groups in a clinical trial using a stochastic covariate-adaptive minimization algorithm66. The success of randomization was assessed by comparing baseline similarity of intervention groups. All other assessors and authors were blind to group allocation information. Our attempt to blind parents (who assessed parent-rated outcomes) was only partially successful, with 31 out of the 51 parents reporting awareness of group allocation. Data were independently double entered to ensure accuracy and stored on an electronic server with restricted, password-controlled access.

Interventions and fidelity

Both interventions (Fig. S1) involved 45-minute individual weekly sessions conducted over 8–12 weeks by the same accredited therapist (M.T.) using established approaches. Using a child-centric approach, MT made use of musical instruments, songs and rhythmic cues while targeting communication, turn-taking, sensorimotor integration, social appropriateness and musical interaction47,67,68,69. NM was designed as a structurally matched “active comparison” play-based intervention to control for non-specific factors, such as positive treatment expectancies, intervention support, therapist attention and emotional engagement. Both interventions were conducted in the same setting and targeted similar outcomes using theoretically motivated approaches70 such as creating a shared experience, building meaningful relationships and emphasizing self-expression71 through the use of varied activities targeting common goal such as verbal and social communication, multisensory integration and emotional regulation (SI Table S1). The primary difference was the use of music as a central component in MT. All sessions were video-recorded to assess treatment fidelity72 (Supplementary Information).

Outcomes

Behavioural outcomes

Primary behavioural outcomes included a social communication battery consisting of the CCC-2 to measure pragmatic communication, SRS-II to measure symptom severity and PPVT-4 to measure receptive vocabulary. Secondary outcomes were FQoL and the maladaptive behaviours subdomain of the VABS. Outcomes were selected to provide both direct and parent-reported evaluations of treatment-related change using measures that have good psychometric properties, limited practice effects and applicability to a wide range of individuals73,74 and were collected at baseline and post-intervention for n = 50 participants (Supplementary Information).

Statistical analysis

Behavioural outcomes were analysed by fitting linear mixed-effects models (LMEMs) with restriction maximum-likelihood estimation to cope with missing data, inhomogeneity of dependent-variable-variance across factor levels and unequal group size. LMEMs with treatment group (MT, NM), timepoint (baseline, post-intervention) and their interaction as well as participant intercept as random effect were estimated for all primary and secondary behavioural outcomes75. Prior to analysis, data were checked for normality. A group×timepoint interaction indicating a change in MT vs. NM post-intervention at P < .016 (Bonferroni-corrected from alpha-level of P = .05 to account for three primary behavioural outcomes) was considered significant. Clinical significance was limited to changes from baseline to post-intervention within MT or significant difference between MT and NM post-intervention as confirmed by post hoc Tukey tests at alpha-level of P = .05. An intention-to-treat analysis was carried out, whereby missing data from any drop-out participants was replaced with data at baseline. Both unstandardized (beta-coefficients and mean difference) scores76 and standardized effect sizes (standardized mean difference, Cohen’s d) are reported since standardized effect sizes are often influenced by study design and complexity of models used. Standardized effects sizes are calculated as the difference in change scores between groups divided by the pooled within- and between-group standard deviation77. The unstandardized measure is a simple effect size (with 95% confidence intervals (CIs)) in terms of mean difference and does not depend on variance estimates78. All statistical analyses were done in R v3.3.479.

Neuroimaging outcomes

Primary neuroimaging outcomes were intrinsic functional brain connectivity of fronto-temporal brain networks measured using rsfMRI at baseline and post-intervention. RSFC methods provide an approach for investigating how musical engagement may alter functional connectivity among several brain regions. RSFC metrics of inter-regional correlations specifically afford the advantage of being task-independent, have high test–retest reliability and provide reliable estimates of brain functional connectivity80. RSFC metrics also have limited practice effects and may provide an objective method to measure response-to-intervention40. Here we tested the extent to which music alters fronto-temporal RSFC in six fronto-temporal seed regions.

Image preprocessing

Resting-state images were first preprocessed using FSL (v. 5.0.9; www.fmrib.ox.ac.uk, FMRIB’s Software Library, FMRIB, Oxford, UK)81,82 via the SeeBARS pipeline developed at the Center for Research on Brain, Language and Music83. Image preprocessing steps consisted of removal of the first five volumes in each scan series as well as removal of non-brain tissue using BET81, slice-time correction, motion correction (using a six-parameter affine transformation implemented in FLIRT, global intensity normalization, spatial smoothing (Gaussian kernel of FWHM = 6 mm), temporal high-pass filtering (100 s) and temporal band-pass filtering (0.01–0.1 Hz). To achieve the transformation between the low-resolution functional data and standard space (MNI152: average T1 brain image constructed from 152 normal subjects), two transformations were performed: (1) T2*-weighted image to T1-weighted structural image (using a 7 degree of freedom (DOF) transformation) and (2) T1-weighted structural image to average standard space (using a 12 DOF linear affine transformation, voxel size = 2 × 2 × 2 mm3). In addition, physiological noise was removed using the method described by Vahdat and colleagues83. The global signal was calculated by averaging the time series over all voxels in the brain. In total, 18 nuisance regressors were used: white matter, cerebrospinal fluid, global signal and their derivatives, and six motion parameters and their derivatives in the first-level analysis84,85. Additional motion scrubbing was done using guidelines in Power et al. (2012)86. Volumes with framewise displacement (FD) = 0.5 mm or DVARS = 50 (the spatial root mean square of the data after temporal differencing) were masked from whole-brain analysis. Participants with >35% volumes censored at either timepoint were excluded from further analysis (n = 6, MT = 2, NM = 4).

Statistical analysis

Seeds were defined as 6 mm spheres around coordinates in the left and right Heschl’s gyrus (HG; ±46 −18 10), left and right inferior frontal gyrus (±50 18 7) and left and right temporal pole (TP; ±38 10 −28; Fig. S3). These seeds are known to anchor fronto-temporal networks involved in language and communication and altered in ASD87. The timeseries for each of the six seeds was used to generate individual participant-level maps using whole-brain general linear models at baseline and post-intervention. The unthresholded participant-level maps were then entered into a group-level analysis. To assess potential differences between groups at baseline, independent sample t tests were computed for maps from all seeds. No baseline differences between groups on any of the six RSFC networks was found (all P > .05). To compare groups post-intervention, we used adjusted analysis of covariance (ANCOVA) with post-intervention RSFC as the dependent variable and intervention group, mean-centred baseline RSFC, age, IQ and mean FD86 as covariates. Using covariate-adjusted ANCOVA models are more powerful as they can account for baseline imbalance and correlation between baseline and post-intervention measures, increase statistical power and minimize biases88,89,90,91,92. Z-scores of parameter estimates were used to measure connectivity strength. In RSFC maps where a difference between groups was observed, we evaluated whether post-intervention RSFC was related to improvement in behavioural outcomes (measured by difference scores) in a whole-brain analysis. Z-statistics were extracted for each participant from the post-intervention RSFC maps and used in a linear regression model to evaluate strength of the association between RSFC and behavioural improvement. To account for multiple comparisons, random-field theory using a cluster-forming threshold of P < .001 was applied93. To account for six seeds, a Bonferroni correction was used and a final alpha-level of P = .00016 was used for significance testing. All locations are described in MNI coordinates.