Subject ascertainment and diagnosis

The GPC is a large cosmopolitan sample of repository and newly ascertained schizophrenia and bipolar disorder cases and screened controls, with considerable representation of individuals with African, European, and Latino ancestries. In the present analysis, we considered as cases all individuals with a diagnosis of schizophrenia or schizoaffective disorder. Details of ascertainment and diagnosis are given in the Supplemental Material.

Single nucleotide polymorphism (SNP) genotyping and imputation

Genotyping of N = 33,422 participants was performed on Illumina Infinium arrays in a total of 11 “batches” (Table 1); four of these cohorts were ascertained as being primarily of African ancestry (OmniExpress 2.5 and Multi-Ethnic Global Array); three cohorts were of broadly Latino background (OmniExpress 2.5 and Multi-Ethnic Global Array); one included participants of any background (Global Screening Array); and three consisted mainly of European participants (OmniExpress and PsychArray) selected as part of parallel research initiatives. Typed variants were aligned to the human reference genome (GRCh37). Within each genotyping batch, we excluded any variant with missingness greater than 2% or Hardy−Weinberg Equilibrium P value <10−6. Our scripts for pre-processing GWAS array data are downloadable from https://github.com/freeseek/gwaspipeline.

Table 1 GPC sample sizes by genotyping batch and assigned ancestry. For constituent datasets in the current analysis (Genotyping Wave/Batch), the commercial genotyping array and the numbers of individuals assigned to African, Latino, and European ancestry groups are displayed. Within each ancestry group, the reported total is based on those quantities appearing in boldface Full size table

Computational phasing was performed for each genotyping batch using Eagle (v2.3.5) [31] and default parameters. Statistical genotype imputation was performed for each genotyping batch using Minimac3 (v2.0.1) [32] and default parameters, using publicly available reference haplotypes from the 1000 Genomes Project (1KGP) Phase 3 [33].

Relationship inference, population structure and ancestry assignment

We used the KING software package [34] to identify duplicates and infer familial relationships in the full GPC cohort using a set of overlapping, genotyped variants. Within genotyping batches, we excluded from pairs of duplicates the sample with the larger fraction of missing genotypes. Next, we retained one sample from each remaining pair of duplicates or first-degree relatives (i.e. parent−offspring or sibling pairs), preferentially retaining cases from affected/unaffected relative pairs. For diagnostically concordant pairs, we considered the degree (and direction) of case−control imbalance in each of the originating batches in terms effective sample size, where N eff = 4/(1/N cases + 1/N controls ). We preferentially assigned samples to batches with smaller ratios of N eff ∶N when this was ameliorative of case−control imbalance, and updated batch-wise values of N cases , N controls and N eff after each assignment.

Principal components analysis (PCA) was performed with GCTA (v1.2.4) [35], using a genome-wide genetic relatedness matrix (GRM) estimated for the full GPC dataset and reference samples from the 1KGP Phase 3 data [33] based on 34,918 genotyped SNPs. For each individual, we estimated genome-wide average proportions of African (AFR), European (EUR), Admixed American (AMR), East Asian (EAS), and South Asian (SAS) ancestry from global ancestry PCs using a simple linear mixed model. Using these estimated proportions and defining significant admixture as 25% or more of a given continental origin, we assigned individuals to three broad ancestry groups: 10,070 African (≥25% AFR and <25% AMR, <25% EAS, <25% SAS); 4324 Latino (≥25% AMR and <25% AFR, <25% EAS, <25% SAS); and 10,580 European (<25% AFR, <25% AMR, <25% EAS, <25% SAS) (Fig. 1). Clustering of individuals in each broad ancestry group with the 1KGP reference populations are shown in Supplemental Figs. 1–3. We refer to the admixed African and Latino ancestry GPC cohorts as GPC-AA and GPC-Latino, respectively.

Fig. 1 Ancestry assignment and Manhattan plots for trans-ancestry meta-analyses of GPC-AA and GPC-Latino with PGC-SCZ2. a PCA-based clustering of GPC participants shaded by broad ancestry assignment. b Red and blue dashed lines denote thresholds for genome-wide significance (P < 5 × 10−8) and replication follow-up in PGC-SCZ2 (P < 10−6). For newly genome-wide significant regions, the top SNP within a 3 Mb region is displayed as a diamond; nearby SNPs in linkage disequilibrium (r2 > 0.1) are highlighted Full size image

Genome-wide association and trans-ancestry meta-analysis

Within each broadly defined ancestry group, we tested for association between imputed genotype dosages and a diagnosis of schizophrenia (or SAD) by logistic regression using PLINK [36, 37], and including the first six ancestry PCs and site/cohort indicator variables as covariates. Within each analysis, we retained variants with imputation quality (INFO) of 0.3 or greater and minor allele frequency (MAF) of at least 1%, based on average values calculated for the combined ancestry cohort. We combined association results across ancestry groups under fixed effects (i.e., inverse variance weighted) and Han and Eskin’s random effects (RE2) models, as implemented in METASOFT [38]. The Han and Eskin random effects model is optimized to detect allelic associations in the presence of heterogeneity [38]. We also applied this method to combine male- and female-specific association results for X chromosome variants.

In our primary trans-ancestry meta-analyses, we combine genome-wide summary statistics for African and Latino ancestry GWAS with the PGC-SCZ2 study results. The discovery phase of PGC-SCZ2 included 34,241 cases and 45,604 controls from 46 European and 3 East-Asian case-control studies, and 1235 parent affected offspring trios from 3 family-based samples of European ancestry [12]. The PGC-SCZ2 summary statistics are publicly available (https://www.med.unc.edu/pgc/results-and-downloads) and have been widely utilized in dozens of follow-up studies, and thus represent a meaningful benchmark for genetic analysis. We apply the same filters for SNP association results as described in the original study (INFO ≥ 0.6, MAF ≥ 1%, and present in at least 20 of 49 studies) and interpret the PGC-SCZ2 results as being broadly representative of findings based on European populations.

Consistency of directions of allelic effects

Linkage disequilibrium (LD) based “clumping” was used to obtain approximately independent sets of SNPs (r2 < 0.1 within a 500 kilobase (kb) window) using the 1KGP Phase 3 European (EUR) data, and preferentially retaining the most significant SNP in the PGC-SCZ2 analysis (among those meeting filtering criteria in the relevant GPC analysis). For varying P value thresholds applied to the PGC-SCZ2 results, we used a binomial sign test to determine if the proportion of same-direction effects in the admixed African or Latino analyses was greater than expected by chance (i.e., a one-sided test of whether this fraction is greater than 0.5). Reciprocal analyses comparing the observed directions of effects in PGC-SCZ2 to the African and Latino ancestry results were also performed, with LD-clumping based on the corresponding 1KGP reference population.

Polygenic risk score profiling

We performed polygenic risk score profiling based on the PGC-SCZ2 summary statistics (the “training” dataset), testing these scores for association with case−control status in African, Latino, and European cohorts from the GPC (the “target” datasets). For each pair of training and target datasets, results for overlapping SNPs (or indels) meeting quality control requirements (imputation quality ≥ 0.3 and MAF ≥ 1%) were subjected to LD-based clumping in the appropriate reference population from the 1KGP (r2 < 0.1 within a 500 kb window); for analyses of African, Latino, and European cohorts, we utilized reference data for AFR, AMR, and EUR populations, respectively. For SNPs significant at varying P value thresholds (P T ) in the training dataset, individual-level scores were constructed by summing the number of copies of a given allele by its corresponding effect estimate (i.e, the log-transformed odds ratio in the training dataset). We evaluated the significance of case−control differences using logistic regression and covarying ancestry-based principal components (PCs) and a study indicator variable. Predictive values of these scores are reported both in terms of Nagelkerke’s pseudo-R2 (fmsb package in R) [39] as well as adjusting for sample and population prevalences of 1% for schizophrenia or bipolar disorder (i.e. the liability scale) [40]. We examined how varying strengths of LD among SNPs used to construct a polygenic score influence within- and cross-ancestry genetic prediction by repeating these procedures and increasing the threshold for “clumping” correlated markers (pairwise r2) to 0.5 and 0.8.

Because genetic prediction is generally worse when comparing training and testing datasets of divergent ancestry, with greater attenuation of predictive value for more divergent populations [27, 28, 41], we constructed analogous polygenic scores based on the African and Latino GWAS results. For within-ancestry prediction, we maintained the independence of training and testing datasets via an iterative “leave-one-out” procedure in which each cohort was omitted, and the remaining samples re-analyzed; the resultant summary statistics represented independent training datasets. For cross-ancestry prediction from the African or Latino GWAS, summary statistics from the primary mega-analysis were utilized.

Trans-ancestry fine-mapping of schizophrenia loci

We attempted to fine-map 276 autosomal and X-chromosome regions around statistically independent SNPs with association P value <10−6 in the publicly available PGC-SCZ2 summary statistics. For each index SNP, we considered SNPs correlated at r2 ≥ 0.6 within a 3 megabase window which had P < 10−4 in the PGC-SCZ2 discovery analysis. We constructed credible SNP sets by combining their posterior probabilities until the sum exceeded 99%, following the approach of Huang et al. [42]. Credible sets for meta-analytic models representing the PGC-SCZ2 discovery phase and its combined analysis with GPC-AA were compared on the basis of total length and number of credible SNPs, and the smallest observed P value among these SNPs; we followed-up regions attaining greater significance in the combined PGC-SCZ2/GPC-AA analysis and for which the credible set in the combined analysis represented a shorter genomic interval than the corresponding interval in the PGC-SCZ2 analysis. We considered a region to be “fine-mapped” if the genomic interval for the reduced credible set was smaller than the corresponding interval for SNPs with LD r2 ≥ 0.6 to the index SNP (based on 1KGP EUR reference data).