Preprint publication

The article was previously published as a preprint in bioRxiv (BIORXIV/2018/363853; https://doi.org/10.1101/363853).

Ethics statement

This study was approved by the IRB of the “Institut Pasteur” of Paris (IRB00006966 Institut Pasteur, approval 2010-003). All participants over 18 and parents of the participants under 18 provided written informed consent to take part in the study (for more details see S1 Appendix).

Genotyping

The cohort available for the genotyping is shown in Fig. 1 and S1 Fig. It includes 36 individuals with autism, 208 controls, 132 close relatives of the individuals with autism (61 siblings, 3 half-siblings and 68 parents) and 9 close relatives from the controls. DNA was extracted from blood leukocytes. The genotyping was performed at the “Centre National de Recherche en Génomique Humaine (CNRGH)” using the Infinium IlluminaOmni5-4 BeadChip (>4.3 millions of markers) from Illumina. Sample quality controls such as Sex check (based on the X chromosome homozygosity rate), Mendel errors (transmission errors within full trios) and Identity By State (IBS, see section below) were performed using PLINK 1.90.38

Population genetic structure

Prior to population genetic structure analysis, the genotyping dataset of the Faroese cohort was merged to the 1000 Genomes project dataset (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/). Genotyping rate and the total number of markers after the merge was ∼ 0.93% and ∼ 4.27 millions, respectively. For the estimation of the ancestry, SNPs with genotyping call rate < 100%, failing Hardy Weinberg equilibrium test (p < 10−3) or on sex chromosomes were filtered out of the merged genotyping dataset (∼ 2.60 million markers passed filters). Genome-wide pairwise IBS calculations and Multidimensional scaling (mds) analysis on genome-wide IBS pairwise distance matrix was calculated using PLINK 1.90. IBS values have been calculated for 376 individuals from Faroe Islands and 2504 individuals from 1000 Genomes project (26 populations; see S1 Appendix) with the following calculation: 1−(0.5 × IBS1 + IBS2)/N; N is the number of tested markers; IBS1 and IBS2 are the number of markers for which one pair of individuals share either 1 or 2 identical allele(s), respectively. Out of the 376 individuals, 32 individuals were removed from further analyses, including 7 ancestry outliers in controls, 9 siblings of controls, one swap and 15 control individuals involved in pairs with IBS score higher than 0.9. For all studies, we used the first three principal components to adjust for population stratification.

For the estimation of the inbreeding coefficient, SNPs with genotyping call rate < 95%, minor allele frequency < 0.05, strong linkage disequilibrium r > 0.5 or failing Hardy Weinberg equilibrium test (p < 10−6) were filtered out of the Faroe SNP genotyping dataset. All homozygosity analyses were performed with Plink 1.09 on autosomes including identification of Runs Of Homozygosity (ROH) and Inbreeding coefficients calculation. For ROH detection, a threshold of 50 consecutive homozygous SNPs with a minimum density of 1 SNP/5000 kb and no minimum length was used following Gazal et al.’s guidelines.39 We allowed no heterozygous markers in the 50 SNPs-window. Inbreeding coefficients were calculated by estimating the proportion of the autosomal genome that is in ROH. This method was proposed by McQuillan and al (2008) and has been shown to be the most reliable, especially with small sample size.40 Faroe inbreeding coefficients were compared to inbreeding coefficient of the 1000 genomes project populations.

Genome-wide association study (GWAS)

Prior to association analyses, SNPs with genotyping call rate < 90%, minor allele frequency < 0.05 or failing Hardy Weinberg equilibrium test (p < 10−6) were filtered out of the Faroe SNP genotyping dataset. The global genome wide genotyping call rate of all the individuals was superior to 90%. A total of 1,690,491 variants and 212 independent individuals (including 36 cases and 176 controls) passed these filters. Allelic, recessive and dominant GWAS were performed with Plink 1.09 using Chi-squared statistics. Manhattan and Quantile-Quantile (Q-Q) plots were generated using R. Gene and gene-set (including SFARI, pLI > 0.9 and Brain gene lists) analyses were performed with MAGMA v1.06 using principal components regression and linear regression models, respectively.

Genome-wide polygenic score (GPS) for autism

The computation of the GPS (also named PRS for polygenic risk score) was performed with the tool PRSice241 on the SNP array data using as a reference the PGC GWAS summary statistics.4 SNPs were not imputed since we used high density arrays (over 4 million SNPs). Briefly, GPS is calculated as a weighted sum of the number of risk alleles carried by an individual, where the risk alleles and their weights are defined by the loci and their measured effects as detected by a previous GWAS. GPS analysis requires the estimation of a P-value significance threshold in order to include in the GPS calculation only variants below this P-value threshold. For our dataset, PRSice2 with default parameters was used (except for the P-value threshold for which a step of 0.01 was used) and defined a P-value threshold of 0.2 which gives us a R2 (squared correlation coefficient) of 0.036 (S10 Fig).

Whole-Exome Sequencing (WES)

Blood leukocytes DNA from 286 individuals was enriched for exonic sequences through hybridization SureSelect Human All Exon V5 (Agilent) by the CNRGH. For 67 individuals for whom the available quantity of DNA was low, we used a low-input protocol using only 200 ng of DNA compared to 3 µg for the normal protocol. The captured DNA was sequenced using a HiSeq 2000 instrument (Illumina). Coverage/depth statistics have been used for quality control. We required that more than 90% of each exome had 10× coverage and more than 80% had 20× coverage. Short read sequences were then aligned to hg19 with BWA v0.7.8, duplicate reads were removed with PicardTools MarkDuplicates. Reads with a global quality under 30 or a mapping quality under 20 were excluded from the analysis. Variants were predicted using FreeBayes and GATK42 with a minimum of 10 reads covering the position. VEP (using RefSeq and Ensembl 91) was used to annotate the variants. We used the GEMINI43 framework that automatically integrates the VCF file into a database for exploring genetic variant for disease and population genetics. Genetic variants were analyzed using GRAVITY, a Cytoscape plugin that we designed for visualizing WES results using Protein-Protein Interaction networks (http://gravity.pasteur.fr/). Since WES does not detect the FMR1 amplification, 33 individuals with autism were tested for Fragile-X syndrome using the AmplideXTM FMR1 PCR kit from Theradiag. No individual were carrier of a “pre-mutation” or “full-mutation” of CGG repeats in the 5’ UTR region of the fragile X mental retardation-1 (FMR1) gene.

Copy-number variants (CNVs)

CNVs were identified from both SNP genotyping and WES data. Quality controls were the following: call rate > 0.99, standard deviation of the Log R ratio < 0.35, standard deviation of the B allele frequency < 0.08 and absolute value of the wave factor < 0.05. CNVs were detected by both PennCNV and QuantiSNP algorithms using the following filters: > = 3 consecutive probes, CNV size > 1 kb and CNV detection confidence score > = 15. CNV detections from PennCNV and QuantiSNP were merged using CNVision.44 CNVs with CNVision confidence score < 30, CNV size < 50 kb, overlap > 50% with segmental duplication or known large assembly gaps (greater than 150 kb) or copy number = 2 in pseudo autosomal regions (PARS) in males were filtered out. CNV annotations were performed using ANNOVAR45 and CNV frequencies in Faroese and in database of genomic variant cohorts (DGV, http://dgv.tcag.ca/dgv/app/home) were assessed using in house python scripts based on reciprocal overlap in size > = 80%. CNV calling from Illumina genotyping data was used only for segregation analysis across families. We also detected CNVs from the WES sequencing data using the XHMM software.46 CNVs with QSOME score < 90, number of targets < 5, or overlap > 50% with segmental duplication or known large assembly gaps (greater than 150 kb) were filtered out. CNV annotations were performed using ANNOVAR45 and CNV frequencies in Faroese were assessed using in house python scripts based on reciprocal target overlap ≥50% and using only independent cases and controls (n = 143). For the burden analysis of rare CNVs, only XHMM CNV calling from WES data was used and CNVs with frequency > 0.01 were filtered out. De novo and inherited CNVs were validated by visual inspection using SnipPeep (http://snippeep.sourceforge.net/).

Gene-set lists and prioritization of variants

Three gene-set lists were used: (i) “SFARI genes” (n = 990) that includes genes implicated in autism15 (Simons Foundation Autism Research Initiative gene database–https://gene.sfari.org/); (ii) “pLI > 0.9 genes” that includes genes with strong probability of being loss-of function intolerant (n = 3230); (iii) “Brain genes” that includes genes specifically or strongly expressed (above 1 Standard Deviation) in fetal or adult human brain using data from Su et al. (n = 3591).47

A combination of approaches was used to prioritize the genes and to estimate the deleterious effect of a variant. We prioritized genes using gene sets (SFARI genes, pLI > = 0.9 and Brain genes). We prioritized Likely Gene Disruptive (LGD) variants (stopgains, splice site variants, frameshift indels) over missense variants or synonymous variants. Additionally, we used the CADD score14 (a CADD > = 30 means that the variants belong to the 0.1% most deleterious variants) to assess the deleterious effect of missense variants. Minor allele frequency (MAF) was estimated in the general population from the gnomAD database. In order to filter out common variants that was not listed in gnomAD, we also excluded variants that were present in more than 15% of our Faroese control cohort. For the detection of deleterious homozygous variants, we kept only LGD and MIS30 with MAF < 1%.

Burden analysis

Rare variant association studies (MAF < 5%) were performed using EPACTS v3.2.6 (https://genome.sph.umich.edu/wiki/EPACTS). Prior to association analysis, variants identified by WES were filtered using VCFtools (http://vcftools.sourceforge.net/man_latest.html) with the following metrics: minimum genotyping quality ≥30, min depth of coverage ≥ 10, maximum of missing data ≤ 10, only bi-allelic sites and no site failing Hardy Weinberg equilibrium test (p < 10−6). The annotation of the variants was done using EPACTS and the variants included in the Gene-wise association analyses were non-synonymous, essential splice site, normal splice site, start loss, stop loss and stop gain variants. Logistic Score Test (“b.score” in S6 Table) was used to test single variant association (n Cases = 36; n Controls = 107; n Variants = 155,284). For Gene-wise tests, we used two approaches (including n Cases = 36; n Controls = 107 and n groups = 15,005): (i) collapsing burden test using EMMAX (Efficient Mixed Model Association eXpedited,48 “CMC-EMMAX” in S4 Table) and (ii) Optimal SNP-set sequence Kernel Association Test (“SKAT-O” in S5 Table). The advantage of the CMC-EMMAX is that this test is accounting for population structure and high relatedness between individual (based on kinship matrix). The advantage of SKAT is that this test is particularly powerful in the presence of protective and deleterious variants and null variants. For both Gene-wise tests, a 10−6 ≤ MAF ≤ 0.05 was used.

Statistical power of the analyses

All the statistical tests performed were one-sided since the objectives were to identify some enrichments in autism participants. Achieved power and necessary effect size (using Cohen’s d) to achieve a power of 0.8 was computed using G*Power (http://www.gpower.hhu.de/). For CNVs (Fig. 2), given our sample size, the post-hoc achieved power is 0.65 for CNV Loss in all genes, and 0.93 for CNV Gain in all genes. The details on subsets of genes are indicated in S12 Table and the sensitivity to detect effect sizes at a statistical power of 0.8 is indicated in S13 Table. For inbreeding coefficient (Fig. 3a), an effect-size of d = 0.43 was observed. Given our sample size, the post-hoc achieved power is 0.74. To achieve a power of 0.8, an effect-size of d = 0.47 was needed. For LGD + MIS30 rare homozygous variants (Fig. 3b), the observed effect-size is d = 0.68. Given our sample size, the post-hoc achieved power is 0.96. To achieve a power of 0.8, an effect-size of d = 0.49 was needed. For GPS (Fig. 4a), the observed effect-size is d = 0.45. Given our sample size, the post-hoc achieved power is 0.77. To achieve a power of 0.8, an effect-size of d = 0.47 was needed. For GPS between individuals with and without ID (Fig. 4b), the observed effect-size is d = 0.80. Given our sample size, the post-hoc achieved power is 0.70. To achieve a power of 0.8, an effect-size of d = 0.92 was needed. For enrichment in de novo events for SNVs (Section “Contribution of de novo variants” of the manuscript), given our sample size, an effect size of d = 0.60 was required to achieve a power of 0.8.