Our approach to derive polygenic resilience scores for SZ is presented schematically in Fig. 1. Supplementary Table 1 outlines decision points that occurred in our analysis and parameters used at each point, including steps for truncating samples and variants for the GWAS and deriving an informative SNP set for resilience scoring. Wherever possible, we adhered to the precedent set by the PGC-SZ Working Group [5].

Fig. 1 An Illustration of our method for deriving polygenic resilience scores for a complex disorder Full size image

Description of case–control GWAS samples

A description of sample ascertainment procedures used in each study is available in the supplementary text published by the PGC-SZ Working Group [5]. All subjects were confirmed to be independent based on relatedness tests using directly genotyped SNPs. Cases had a clinical diagnosis of SZ-spectrum disorders or SZ based on Diagnostic and Statistical Manual for Mental Disorder (DSM)-Version IV or International Classification of Diseases (ICD), 10th revision. Control ascertainment varied across studies. Our study inherited the same issues that the PGC-SZ faced in that some of the studies comprised control sets that were not screened for SZ, which to the best of our knowledge included the Gottingen Research Association for Schizophrenia (GRAS, controls n = 1232) and the Icelandic deCODE Genetics, Inc. sample which served as an out-of-sample replication set (controls n = 138,761). If there happen to be controls affected with SZ in our discovery sample (unlikely to be a large number as population prevalence of SZ is ~1%), then the loss of power potentially incurred in our analysis should be proportional to published GWAS by the PGC, which in that study was deemed reasonable [5].

Meta-analysis of 51 GWASs in PGC2-SZ

We obtained GWAS summary statistics per study for the same 45 European-ancestry case–control studies, three East Asian-ancestry case–control studies, and three European trio-based studies assembled by the PGC Schizophrenia Working Group for their published wave 2 data set (PGC2-SZ) [5]. We then meta-analyzed single-variant association results genome-wide using inverse-variance fixed-effect summary statistics for these 51 studies using the inverse-variance fixed-effect method in the software METAL [10].

LD-clumped SNPs and weights for risk scoring

We obtained from the PGC-SZ a list of 103,129 linkage-disequilibrium (LD)-independent SNPs and effect-size weights derived from the 51-study GWAS meta-analysis of SZ [5], along with imputed GWAS data for 45 of the 51 studies analyzed by Ripke et al. [5] (excluding six studies privately owned by pharmaceutical companies and unavailable to us for secondary analysis: four Johnson & Johnson and Roche samples, the Pfizer sample, and the Eli Lilly sample).

Identifying subjects with high levels of polygenic risk

The variance in SZ explained by polygenic risk score (PRS) was maximized at a p-value bin of p < 0.05 [5], thus we used that threshold for selecting risk alleles for our PRS analysis. PRS were standardized using z-score scaling within each study. Subjects were then ranked by PRS and a percentile-based threshold was used to identify the highest-risk controls, i.e., “resilient” controls, along with cases with similarly high PRS. We set an arbitrary threshold at the 90th percentile of PRS in controls, and called controls above that threshold ‘resilient’ (different thresholds could be chosen). SZ cases whose PRS was between the 90th percentile for controls and the maximum PRS for controls were retained as the comparison group (Fig. 1). This method produced 3786 high-risk resilient controls (includes 83 pseudo-controls from trio studies) and 18,619 equal-risk cases (includes 494 cases from trio studies) for analysis. The total number of cases and controls retained per study after risk score matching is provided in Supplementary Table 2.

Ripke et al., reported that subjects in the highest PRS decile exhibit an increased risk for SZ (OR = 8–20) compared with the baseline rate of SZ in the lowest PRS decile [5], which is on-par with the estimated increase in risk for SZ among persons with affected first-degree relatives [11]. Building on these findings, Supplementary Table 3 shows that the population in the upper 10th percentile of PRS (who were included in our analysis) have an absolute risk of 40% and an increase in relative risk of 1.91 compared with all subjects above the lowest 10th percentile of PRS, which is similar to the relative increase in risk for those with an affected aunt or uncle [11].

Derivation of polygenic resilience scores

Marginal SNP effects on resilience were computed using logistic regression models with Plink v1.9 [12] including adjustment for four principal components derived from autosome-wide SNP data that were significantly different between high-risk controls and equal-risk cases at a significance threshold of p < 0.1 to correct for population stratification (Supplementary Table 4). Any variant that showed an association with SZ risk from our meta-analysis of the 51 PGC2-SZ studies at a p < 0.05, or variants that were in LD with risk variants (at a R2 > 0.2 in a 1 megabase window), were discarded from this GWAS of resilience as a conservative measure, to avoid re-discovering risk variants and to ensure that any polygenic resilience score derived from our analysis was independent of the polygenic risk score used to stratify the samples. Our choice of threshold for pruning away SNPs in LD was used by the PGC and others [13,14,15]. We considered setting a stricter threshold of R2 ≤ 0.0, but this would have removed almost the entire genome and left only six variants for evaluation. Marginal SNP effects were obtained per study, and then pooled using inverse-variance fixed effect meta-analysis [10] to arrive at a final GWAS summary statistic per SNP.

Generating a polygenic score formula for resilience was done in a series of steps. Clumping the GWAS summary statistics output (Plink command: --clump-kb 500 --clump-r2 0.2 --clump-p1 1.0 --clump-p2 1.0) was performed using a reference panel that represents the predominant ancestry found in the GWAS sample (i.e., 1000 Genomes European phase 1 version 3, n = 379). We removed variants according to the following exclusion criteria referenced by the PGC-SZ working group in ref. [5] to retain an informative SNP set for polygenic resilience scoring: (i) variants in the MHC region (chr6:25 Mb-34 Mb), (ii) variants in the chromosome 8 inversion region (chr8:7 Mb–14 Mb), (iii) variants with an imputation quality score <0.9, and (iv) variants that are strand-ambiguous (AT/CG genotypes) or were small insertions/deletions. In addition, we removed variants with a minor allele frequency <0.05 and variants present in 10 or fewer studies. A total of 80,822 LD-independent SNPs associated with resilience to SZ were available for inclusion in calculating “resilience scores” in our training and test sets.

Resilience scores were determined by counting the number of protective alleles for sets of variants defined by p-value cutoffs (p < 0.0001, <0.001, <0.01, <0.05, <0.1, <0.2, <0.3, <0.5, <0.7, <1.0) and multiplying allele counts by the natural logarithm of the resilience odds ratio for each variant. We adopted a method used by the PGC [5] to estimate the amount of variance in resilience status that can be attributed to resilience scores. The approach is a two-model regression. In the first regression, resilience scores were specified as a predictor variable to estimate the odds of being a high-risk resilient control versus a matched-risk case (treated as the reference) per standard unit increase in resilience score. Four principal components derived from genome-wide SNP data that were significantly different between high-risk controls and high-risk cases were included in this regression as covariates to control for population stratification. A second logistic regression model was fit to estimate the amount of additive variance that the four principal component covariates that were specified in the first model contributed to resilience status. The variance in resilience status explained by each model was (based on Nagelkerke’s pseudo-R2) was calculated using the R package fmsb (version 0.6.1). We calculated the difference in R2 values between in the two models, yielding a value of R2 that was attribute to the amount of variance in resilience status explained by resilience scores.

Correlation and interaction analyses of risk and resilience scores

We calculated Pearson’s correlation coefficients for risk and resilience scores in four separate groups: (1) SZ cases, (2) controls, (3) subjects with low-risk scores for SZ grouped by case–control status, and (4) and subjects with high-risk scores grouped by case–control status. Ultra-high-risk cases were excluded from the analysis (i.e., SZ cases with a risk score exceeding the maximum risk score of controls) as they lacked a matching set of controls. In addition, we performed a logistic regression analysis using all PGC samples (excluding ultra-high-risk cases) to test for a non-linear effect of risk and resilience scores on case–control status. Subjects were then split into deciles based on risk score. With logistic regressions, we computed the odds of a being a case between the bottom decile compared with each of the other deciles, as well as comparing the estimate the change in odds of being a control versus a case within each decile based on the unit increase in resilience score. The top 10 principal components for ancestry were included as covariates in the regression models.

Replication

We had direct access to imputed GWAS data for two independent studies that were withheld entirely from all discovery analyses and used exclusively in our replication analysis (i.e., the Molecular Genetics of Schizophrenia (MGS), and the Danish iPSYCH study) along with the summary results from a third fully Icelandic sample from deCODE Genetics [16,17,18]. Subjects in the MGS (n controls = 2,482, n cases = 2,638) sample were ascertained from the US and Australia, which included cases with a DSM-IV diagnosis of SZ or schizoaffective disorder, and controls with no known history of mood, anxiety, substance use, psychotic, or bipolar disorders. Cases from the iPSYCH Consortium sample (n controls = 10,175, n cases = 3634) were ascertained from the Danish Civil Registration System and linked to the Danish Psychiatric Central Research Register to obtain diagnoses of SZ, whereas controls were ascertained by random sampling from the Danish Civil Registration System and removing individuals with a diagnosis of SZ or bipolar disorder. The third case–control replication set was the Icelandic population-based sample generated by deCODE Genetics, Inc [18]. comprised of 138,761 controls and 873 cases. Risk scores were calculated in the replication samples using SNP rsIDs and weights derived from the 51-study GWAS meta-analysis of risk. We applied our 90th percentile threshold method to identify high-risk controls and equal-risk cases. Resilience scores were calculated in the replication samples using the formulae derived in the discovery sample, and logistic regression models were used to estimate the effect of resilience scores on affection status and the proportion of variance in resilience explained by resilience scores after adjusting for select principal components to control for population stratification. For the MGS and iPSYCH samples, we selected principal components that significantly differed between high-risk controls and equal-risk cases at a significance threshold of p < 0.05 (three for MGS and two for iPSYCH). The top 10 principal components were used for the deCODE sample. We performed an inverse-variance fixed-effect meta-analysis with the R package metafor (version 2.0.0) using the natural logarithm of odds ratios and standard errors for resilience scores obtained from the MGS, Danish iPSYCH, and Icelandic deCODE samples, in order to assess the aggregate predictive capacity of resilience scores. Using code adapted from Ricopili (https://github.com/Nealelab/ricopili), the proportion of variance in resilience status explained by resilience scores was transformed to the liability scale assuming a population prevalence of 10% based on the 90% cutoff used to define resilience [19].

Gene annotations

We downloaded a GTF file containing the positions of 57,820 protein- and non-coding genes, RNAs, and pseudogenes from the human reference genome version GRCh37.p13 from GENCODE [20]. The mapping of variants to genes was performed using the R package GenomicRanges. We retained annotations for protein-coding genes, non-coding genes (microRNAs, snoRNAs, snRNA, lincRNA), and antisense genes.

Enrichment of resilience SNPs in risk genes

After annotating SNPs to genes, we computed association scores per gene for both risk and resilience to SZ by averaging the z-scores of intragenic SNPs within a given gene obtained from our meta-analyses. Gene scores were determined using 480,469 intragenic risk SNPs and 1,681,145 intragenic resilience SNPs. We did not extend the coordinates of genes beyond the intragenic region when mapping SNPs to genes. Only risk SNPs at a p < 0.05 significance level were included to ensure that risk SNPs were mostly LD-independent (R2 < 0.2) from resilience SNPs found within the same gene. A linear regression model was used to predict the per-gene risk association score using the per-gene resilience association score while simultaneously adjusting for gene length (in kilobases), the number of SNPs per kilobase of gene length, the average minor allele frequency of variants in each gene, and the chromosome the gene was located on. Cluster-robust standard errors were computed to correct for heteroscedasticity potentially caused by LD between genes.