Phenotypes

The phenotypic data were obtained in the longitudinal surveys on lifestyle, health, and personality of the NTR (e.g., Boomsma et al. 2002, 2006). The study protocols were approved by the Central Ethics Committee on Research Involving Human Subjects of the VU University Medical Center, Amsterdam. All participants provided informed consent. The study in young twins was approved also by the Central Committee on Research Involving Human Subjects. More details regarding the phenotyping in the NTR study can be found elsewhere (van Beijsterveldt et al. 2013; Willemsen et al. 2013).

Initiation of Cannabis use (‘ever/never’)

Initiation was assessed by a multiple choice question (i.e., “At which age did you experiment with cannabis for the first time?”) in the NTR surveys 1993, 1995, 2000, and by an open-ended question (“Have you ever tried hashish or cannabis? If yes, at which age?”) in survey 2009. These surveys were sent to all adult twin families and were returned by 23 597 individuals. In addition, data collection in adolescent twins and sibs which took place since 1987 in age-specific surveys (around age 14 and age 16), included a multiple choice question (“Have you ever used soft drugs such as hashish or cannabis?”) assessing frequency of use (on an eight-category scale ranging from ‘never’ to ‘more than 40 times’) in the whole life, in the last 12 months and in the last 4 weeks. This question was completed by 16 556 participants. The phenotypic data obtained from subjects who reported at more than one time point were checked for consistency, and unreliable measures were discarded. Due to inconsistencies, 284 self-reported measures were dropped. Next, the measurements were collapsed into a dichotomous phenotype (i.e., ever/never used cannabis). Furthermore, we included in the analysis only family members for whom both phenotypes and genotypes were available, i.e., N = 6744 participants. Of these, 5387 individuals reported never to have used cannabis, whereas the remaining 1357 individuals had initiated cannabis use. The age at the time of the last survey ranged from 10.5 to 94 years (mean age = 39.09, SD = 17.45). The participants were clustered within 3479 families varying in size from 1 to 9 family members (i.e., parents, siblings, spouses). More than half of the sample (60.9 %) consisted of females.

Age at onset

A subset of the genotyped NTR sample (N = 5148) had declared never to have used cannabis, or declared an age at onset older than 10 years of age in survey 2009 (which included an open ended question on age at onset, see above). Among them, 852 (16.6 %) had initiated cannabis use, whereas 4296 observations had not initiated at the time of data collection (i.e., censored observations). The participants were clustered within 2992 families of sizes varying from 1 to 8 members. Females represented 62.3 % of the sample and the age ranged between 16 and 99 years (mean age = 46.93, SD = 17.54).

Genotypes

Genotyping was performed based on buccal or blood DNA samples collected in different research projects (see e.g., Willemsen et al. 2010). Imputation was performed based on the 1000G GIANT phase1 panel as a first reference set, and on the GONL version 4 as a second reference set (see Supplementary Methods for details). As best guess genotypes (computed using Beagle, Browning and Yu 2009) were used in the analyses, we applied stringent post imputation quality thresholds on the imputation quality measure (i.e., we retained only SNPs with an imputation quality score above 0.8) and for the Hardy–Weinberg equilibrium test (α = 1 × 10−4). Both the imputation quality and Hardy–Weinberg equilibrium (i.e., based on the summed genotype probability counts) were assessed in the phenotyped sample using SNPTEST (Marchini, 2007). The GoNL- and the 1000G-based imputed datasets contained ~6 million well imputed SNPs (i.e., with a mean imputation quality score above 0.96 in both datasets). The association and survival analyses were carried-out by varying the reference panel used for imputation, while including the same phenotyped sample (i.e., 6744 and 5148 participants, respectively). The analyses included no monozygotic twin pairs, because genotypic data were available for only 1 twin of a pair in the GoNL dataset.

Statistical analyses

Estimating the heritability of initiation

We used the Genome-wide Complex Trait Analysis (GCTA) software (Yang et al. 2011) to estimate the amount of variance in initiation explained collectively by the SNPs. The aim of this analysis is to obtain an indication of the total signal in the SNPs, without identifying individual SNPs. Genetic similarity among the phenotyped individuals was computed based on best guess genotypes at 5 928 887 loci observed or imputed using the GoNL reference panel. The analyzed SNPs had a MAF larger than 1 %, imputation quality greater than 0.8 and showed no significant deviation from Hardy–Weinberg equilibrium given α = 1 × 10−4. The sample with observed initiation status (N = 6744 related individuals of Dutch ancestry) and the relevant covariates included in the genomewide SNP-based analysis (see below) were also used in the GCTA analysis. Furthermore, one of a pair of closely genetically related individuals (i.e., with an estimated genetic relatedness larger than 0.025) was dropped, which left for the analysis 3616 distantly related individuals. We specified the prevalence as equal to 22 %, value chosen in line with the prevalence of cannabis use estimated in Europeans (European Monitoring Centre for Drugs and Drug Addiction, 2012). Heritability of age at onset was not estimated as GCTA cannot handle survival data. We also investigated the relationship between chromosome length and the amount of variance explained in the trait. Consistent with the model of a polygenic trait, we expect—on average—the longer chromosomes to explain a larger amount of the variance. We tested this in a linear regression (one-tailed test) where we regressed the estimated proportion of variance explained by each chromosome on the chromosome length.

Power analysis

We performed a Monte Carlo power analysis to obtain an indication on the size of the genetic effects detectable in our sample. To this end, we simulated 10 000 samples consisting of 3690 families of various configurations reflecting the unbalanced structure of families included in the analyses, i.e., families consisting of singletons, two parents or families comprising sibships sizes 1–6 with 0, 1 or 2 parents. Genotypes in Hardy–Weinberg equilibrium were generated at a locus with a MAF of 0.5 and explaining 1.5 and 1 % variance in the phenotype. The normally distributed phenotype was simulated conditional on the locus and then dichotomized using a cut-off point corresponding to a z-score of 0.85 to mimic the 20 % prevalence of initiation observed in the NTR sample. The correlations between spouses, full siblings and parent-offspring estimated in our sample equaled 0.39, 0.35 and 0.15, respectively. An α = 1 × 10−8 was used to assess the power to detect association. To model association we used a generalized equations estimation (GEE) procedure with an exchangeable working correlation matrix and a sandwich correction to correct the standard errors for misspecification of the background model (Minica et al. 2014).

Empirical power analysis showed that our sample affords 45.3 and 87.4 % power to detect GVs explaining 1 and 1.5 % phenotypic variance, respectively (genomewide alpha = 1 × 10−8). Relative to the logistic model, the survival model is expected to show superior power especially for locating low frequency causal GVs (see e.g., van der Net et al. 2008). However, the above power computations are informative also for the age at onset phenotype given the large overlap among the samples included in the two analyses and the slightly lower size of the sample we used in the survival analysis.

SNP-based association analysis of initiation

To test association, initiation was regressed on the best guess genotype and covariates. The covariates were sex, age at the last survey, the birth cohort (i.e., two birth cohorts containing individuals born between 1951 and 1970 and 1971–1999, respectively, and the 1915–1950 birth cohort as the reference category), 3 principal components to correct for Dutch population substructure (Abdellaoui et al. 2013), and sample specific covariates to account for batch and for chip effects. A GEE (Carey et al. 2012) logistic model was employed. To model the familial relatedness, we used an exchangeable working correlation matrix. This accounts for the familial correlations by means of a single correlation among the family members. The effect of possible misspecification of the familial covariances on the standard errors was corrected by means of a sandwich correction (Minica et al. 2014; Dobson 2002). The sandwich-corrected GEE approach was implemented by using the R-package gee (Carey et al. 2012), accessed from Plink (Purcell et al. 2007) which communicates with R (Team 2013) via the Rserve package (Urbanek 2013).

SNP-based survival analysis of age at onset

A Cox proportional hazards regression model was employed to model age at onset as a function of genotype and—as above—of other relevant covariates (i.e., birth cohort, sex, three PCs and study specific covariates). We included this approach as it utilizes all available information on the age of initiation among those who have initiated. It is expected to show superior power relative to an analysis of the “ever-never” dichotomy or an analysis restricted to those who initiated (see e.g. Kiefer et al. 2013). The Cox proportional hazard regression analysis was performed genomewide by accessing the survival R-package (Therneau 2014) from Plink. In fitting the model, we used the cluster option to get sandwich corrected standard errors that are robust to possible misspecification of the familial covariance matrix.

Gene-based analyses of initiation and age at onset