Participants

The PNC is a single-site, community-based study of 9498 youths aged 8–22. Data from the initial cross-sectional sample, reported here, was collected from November 2009 to December 2011. For extensive details on the PNC, see refs [31, 34]. Importantly, selection was not based on psychiatric or substance use symptoms. PNC inclusion criteria included living in the tristate area of Pennsylvania, New Jersey, and Delaware; proficiency in English; and ability to provide informed consent/assent. Exclusion criteria were significant developmental delays or physical conditions that would interfere with participation in assessments. Criteria were intentionally broad in order to enhance generalizability. Participants who received neuroimaging were excluded for standard MRI contraindications. Participants and their guardians (for participants under 18) provided written informed consent/assent. Institutional Review Boards at University of Pennsylvania and Children’s Hospital of Philadelphia approved the protocol. See Figure S1 for a flowchart of sample construction.

Substance use and psychopathology assessment

Details of PNC substance use assessments were previously reported in detail [35]. Briefly, most participants received an abbreviated version of the Minnesota Center for Twin and Family Research self-report substance use assessment [36], which was privately self-administered on a laptop computer. The measure assessed lifetime use of several substances; for cannabis and alcohol, additional questions queried age at first use, frequency of past year use, and methods of access. In the initial phase of the project, prior to implementing the self-report measure, the Kiddie-Schedule for Affective Disorders and Schizophrenia (K-SADS) substance screening interview was administered (6% of participants). Since this screener did not query frequency of use, here we only include the subset of participants who denied cannabis use from those administered the K-SADS screener. This measure was subsequently replaced to accommodate high participant volume and reduce social desirability biases. To further reduce social desirability biases, participants were assessed separately from collaterals (e.g., parents) and informed that information reported would be kept confidential except as legally required. Participants endorsing use of fake drugs (e.g., “cadrines”) were excluded from analyses (n = 8), similar to previous work [35].

Analyses were conducted in participants between 14 and 22 years old (n = 781) given limited cannabis use under age 14 [see ref. 35]. Informed by prior work from our group [35] and others [37, 38], we divided cannabis users into frequent users (≥ “3–4 times per week”; n = 38) and occasional users (≤ “1–2 times per week”; n = 109). Information regarding abstinence and urine toxicology were not acquired. To examine associations with cannabis from cumulative recent use, as opposed to remote use, we only examined cannabis users who endorsed use over the past year, removing 62 participants from analysis.

As described previously [31], a computerized K-SADS collected information on symptoms, duration, distress, and impairment for lifetime mental health symptoms. Empirically-derived psychopathology factor scores were generated from this measure to parse psychopathology into symptom dimensions (see Supplementary Methods and ref. [39]).

Neuroimaging acquisition

A subset of the PNC (n = 1601) received structural MRI, as previously reported [34]. Imaging data was acquired on a single MRI scanner (Siemens 3T TIM Trio, Erlangen, Germany; 32-channel head coil) using the same sequences for all participants. A magnetization‐prepared, rapid acquisition gradient‐echo (MPRAGE) T1‐weighted image was acquired with the following parameters: TR 1810 ms; TE 3.51 ms; FOV 180 × 240 mm; matrix 192 × 256; 160 slices; slice thickness/gap 1/0 mm; TI 1100 ms; flip angle 9°; effective voxel resolution of 0.93 × 0.93 × 1.00 mm; total acquisition time 3:28 min.

Three highly-trained image analysts independently assessed structural image quality, as previously described in detail [32]. Briefly, prior to rating images, three raters were trained to >85% concordance with faculty consensus ratings on an independent training sample of 100 images. T1 images were rated on a 0–2 Likert scale (0 = unusable images [3.1%]; 1 = usable images with some artifact [16.9%], and 2 = images with none or almost no artifact [80.0%]). Images with ratings of 0 were excluded. Average quality rating across the three raters was included as a covariate in all models to control for the confounding influence of subtle variation in image quality.

Structural image processing

We evaluated multiple measures of brain structure including cortical thickness (CT), volume, and GMD. Image processing and analysis was primarily conducted with Advanced Normalization Tools (ANTs). We also performed supplementary processing of images with FreeSurfer to evaluate the robustness of results.

ANTs image processing

CT was quantified using tools in ANTs [40]. To avoid registration bias and maximize sensitivity to detect regional effects that can be impacted by registration error, a custom adolescent template and tissue priors were created using data from 140 PNC participants, balanced for age and sex. Structural images were processed and registered to this custom template using the ANTs cortical thickness pipeline [41], which includes brain extraction, N4 bias field correction [40], Atropos tissue segmentation [42], SyN diffeomorphic registration [43], and diffeomorphic registration-based CT (DiReCT) estimation in volumetric space [44]. Regional CT averaged CT estimates over anatomically-defined regions, as defined below.

To parcellate the brain into anatomically-defined regions, we used an advanced multi-atlas labeling approach. Specifically, 24 young adult T1-weighted volumes from the OASIS data set [45], manually labeled by Neuromorphometrics, Inc., were registered to each subject’s T1-weighted volume using SyN diffeomorphic registration [43]. Using multiple manually-labeled atlases yields more accurate estimates of anatomical regions compared to manually-labeled images [46]. Label sets were synthesized into a final parcellation via joint label fusion [46]. To increase tissue specificity, volume was determined for each parcel using the intersection between the parcel created and a prior-driven gray matter cortical segmentation from the ANTs cortical thickness pipeline.

Finally, GMD was calculated via Atropos [42], with an iterative segmentation procedure initialized using 3-class K-means segmentation. This procedure produces both a discrete 3-class hard segmentation and a probabilistic GMD map (soft segmentation) for each subject. GMD was calculated within the intersection of this 3-class segmentation and the subject’s volumetric parcellation [33]. Importantly, this method is distinct from methods in most prior studies that use GMD interchangeably with voxel based morphometry analyses [e.g., 47], which display relationships closer to volumetric analyses than to density parameters [33].

Supplementary FreeSurfer image processing

Cortical reconstruction of T1 images was performed using FreeSurfer version 5.3 [48]. See Supplementary Methods for detailed procedures.

Group-level analyses

Group differences in structural imaging variables were explored using ANOVAs and Kruskall–Wallis tests. In all models, we controlled for potentially confounding variables including race, sex, overall psychopathology, average quality rating, and nonlinear effects of age. Nonlinearities were modeled using generalized additive models (GAMs) with penalized splines as implemented in the ‘mgcv’ R package. Interactions between groups and age, quadratic, and cubic expansions of age were explored in a similar framework. We examined group differences for global (mean GMD, total brain volume [TBV], cortical thickness), lobar (frontal, temporal, occipital, parietal, insular, limbic), and regional measurements. To explore significant omnibus ANOVAs, pairwise relationships were explored using t-tests. False discovery rate (FDR) correction was used to account for multiple comparisons throughout. Analyses were conducted in R version 3.3.

Follow-up nonparametric, data-driven analyses were conducted to probe for pairwise group differences while limiting the potential influence of non-normal distributions or outliers. Mean differences were estimated across 10,000 bootstrap folds as implemented by the ‘boot’ package in R. The ‘boot’ package allows for resampling while preserving group proportions, yielding 10,000 samples with roughly equivalent group distributions. Studentized 95% confidence intervals were then obtained. To account for multiple comparisons, p-values were obtained for every confidence interval, as previously detailed [49], with FDR correction.

As described below, these analyses revealed predominantly non-significant effects. Since non-significant tests do not necessarily support null results, we performed follow-up equivalence tests to examine whether the presence of effects of a particular magnitude could be statistically rejected, allowing for greater specificity in defining the magnitude of potential group differences. Two one-sided t-tests (TOSTs) evaluated equivalence between each pairwise comparison [50] as implemented in the ‘equivalence’ package in R. TOSTs require an upper and lower bound effect size. Due to increased sample sizes required to conduct TOSTs [50], effect sizes conventionally considered to be medium magnitude were first examined, setting our equivalence boundary at d = −0.5 and 0.5, respectively. Follow-up analyses used an effect size boundary from d = −0.3 to 0.3 to compare occasional and non-users (frequent user comparisons were not conducted at this boundary due to limited power). Two composite t-tests were run, one probing larger and the other smaller than the prespecified boundaries. In these tests, the null hypothesis is non-equivalence, or the presence of an effect.

Supplementary analyses

We conducted several supplementary analyses to examine whether potentially confounding variables may have influenced observed results. First, we conducted analyses of global, lobar, and regional measurements with additional covariates. In the first, item-wise alcohol variables were included as covariates (see Supplementary Methods). Alcohol data were unavailable for n = 35 participants. In the second, estimated IQ, as assessed by the Reading Subtest of the Wide Range Achievement Test-4 [51], was included as a covariate. Finally, we used propensity score matching [52] to match all cannabis users to a subsample of non-users on age and sex and re-ran all analyses (see Supplementary Methods).