PRS usage and performance in worldwide populations

How well-different ancestry groups have been represented in the first decade of polygenic scoring research (2008–2017, inclusive) is shown in Fig. 1a, which presents cumulative distributions of numbers of studies for specific ancestry groups across time. The field has been dominated by European ancestry studies. Across the 733 studies examined (see Methods for inclusion criteria and Supplementary Data 1 for a list of studies), 67% included exclusively European ancestry participants. There have also been 140 studies conducted in exclusively Asian populations (19%), most commonly in East Asian countries (e.g., China and Japan). Only 3.8% of the polygenic studies from the first decade of polygenic scoring research concerned populations of African, Latino/Hispanic, or Indigenous peoples combined. (Note that we retain population names from the original reports (e.g., Native American and Middle Eastern) in Fig. 1 in order to maintain consistency in terminology. Combined denotes that more than one ancestry group was included in the study (e.g., European ancestry and Asian ancestry participants)). These results are similar to those reported by Popejoy and Fullerton3, who noted that non-European ancestry representation in GWASs was almost exclusively in Asian populations, and East Asian populations in particular.

Fig. 1 Ancestry representation in the first decade of polygenic scoring studies (2008–2017; N = 733 studies). a Cumulative numbers of studies by year are denoted by color. The stacked bar graph below the cumulative distribution plot shows proportional ancestry by year. b Stacked bar charts depict world ancestry representation (left) and polygenic scoring study representation (right). c The percentage representation for each ancestry group is given, such that 100% would indicate equal representation in the world and in polygenic scoring studies. For example, European ancestry samples are over-represented (460%) whereas African ancestry samples are under-represented (17%) Full size image

By comparing representation of particular ancestry groups with world population estimates for those groups (Fig. 1b), it is possible to quantify the over- or under-representation of each major ancestry group. European ancestry representation was ~460% of what it would be if representation was proportional to world ancestry. In contrast, African ancestry (17%) and Latino samples (19%) were under-represented relative to world populations. East and South Asian samples are combined in this figure, but it should be noted that representation of East Asian samples is much higher than South Asian samples, which have been included in very few polygenic scoring studies to date. Relative to world populations for these groups, Middle Eastern and Oceanic populations have the lowest representation in polygenic scoring studies (10% and 0%, respectively).

Having analyzed the use of polygenic scores in different ancestry groups (above), we next assessed the performance of polygenic scores in multiple ancestry groups. Since most large-scale GWAS have been conducted in primarily (or exclusively) European ancestry individuals42, our a priori hypothesis was that polygenic scores would perform best among European ancestry individuals, and less well for other populations. Figure 2 provides an overview of polygenic score performance across ancestry groups. Results from all complex genetic phenotypes are analyzed together in order to increase the amount of data available for analysis. In Fig. 2, each point represents one within-study comparison between a non-European ancestry sample and the matched (within-study) European ancestry sample. The vertical black line represents equal performance in the non-European ancestry sample, as compared to the matched European ancestry sample from the same study.

Fig. 2 Forest plot of performance shows variation in polygenic score performance by ancestry (26 studies). Each row in the forest plot (left) represents one pair of polygenic analyses (i.e., in a non-European ancestry sample and a matched European ancestry sample from the same study. Phenotypes, citation information, and available effect sizes are given for each comparison. The vertical black line at 100% corresponds to equal performance in the non-European ancestry and the European ancestry samples. Vertical colored lines denote median standardized effect sizes, for each of the major ancestry groups. On the top right, median values for standardized effect sizes, for each major ancestry group, are given. Standard errors are not provided because many studies lacked sufficient information; however, statistical significance of each non-European ancestry analyses is denoted by point size. HDL-C high density lipoprotein cholesterol, VLDL very low-density lipoprotein, GERA Genetic Epidemiology Research on Aging, OR odds ratio, UKB UK Biobank, IgAN immunoglobulin A nephropathy, AUC area under the curve, BP blood pressure, BMI body mass index Full size image

As shown in Fig. 2, polygenic score performance was worst among African ancestry samples. The median effect size of polygenic scores in African ancestry samples was only 42% that of matched European ancestry samples (t = −5.97, df = 24, p = 3.7 × 10−6). Relative to matched European ancestry samples, performance was also lower in South (60%) and East Asian (95%) samples, but not significantly so (see top right portion of Fig. 2). In sum, an expectation of poorer polygenic score performance in non-European ancestry populations seems reasonable given these data. Attenuation of predictive performances is likely to be most extreme in samples of African ancestry, consistent with, on average, greater genetic distance between European and African ancestry populations, than between European and other ancestry populations28,43.

Methodological choices impact polygenic score distributions

We now consider questions about possible differences in polygenic scores among ancestral populations. Polygenic scores, as currently calculated, vary with ancestry. Indeed, polygenic scoring practices from as early as 2009 accounted for this12. The method used by Purcell et al. in 200912 (and frequently since) includes two steps for mixed ancestry samples. First, samples are separated into more ancestrally homogeneous subgroups (using visual inspection of plots of principal components calculated on all genetic data from all samples). Second, principal components are calculated again within each of these more ancestrally homogeneous subgroups, and are used as covariates in polygenic scoring analyses, which are conducted separately within each subgroup. However, some research groups are not aware of these methodological recommendations and others (understandably) prefer a more inclusive analytical approach of analyzing all samples together (instead of creating subgroups). Figure 3 demonstrates why care must be taken in treatment of ancestry in polygenic scoring studies.

Fig. 3 Polygenic score distributions vary by ancestry and methodical choices. For polygenic score construction, clumping is often used, and investigator-driven choices can produce large differences in score distributions for global populations. Polygenic score distributions for the five major 1000Genomes populations are plotted, showing how investigator-driven choices impact score distributions. For all plots, weights were derived from the UK biobank height GWAS. Both r2 values used in clumping (r2 = .2, .05, .01; see columns) and 1000Genomes populations used for clumping were varied (ALL, EUR, AFR, AMR, EAS, SAS; see rows). a, b correspond to the p-value threshold (p T ) applied to the height summary statistics. a p T = genome-wide significant variants (p < 5 × 10−8); b p T = full genome variants (p < 1). PRS=polygenic risk score. ALL union of five 1000Genomes populations Full size image

As shown in Fig. 3, methodological choices in the construction of polygenic risk scores can cause dramatic differences in distributions of polygenic risk scores for worldwide populations (polygenic scores were constructed for all 1000Genomes participants, N = 2577, see Methods for additional details). In general, the inclusion of more variants caused greater dispersion of distributions for these 1000Genomes populations (i.e., comparing panel A with panel B; genome-wide significant variants to all variants). Lower r2 thresholds for clumping tended to make worldwide population distributions more similar (left to right, in both A and B), and use of particular populations for clumping also dramatically affected distributions, particularly when East and South Asian populations were used for clumping. For further demonstration of the large effect that these methodological choices have on polygenic score distributions, see Supplementary Fig. 1 (polygenic scores for height18, weighted with GIANT) and Supplementary Fig. 2 (polygenic scores for PTSD10, from a multi-ancestry GWAS from the Psychiatric Genomics Consortium, PGC).

Regarding polygenic scoring practices and as stated above, it is typical for researchers to separate samples into more ancestrally homogeneous groups prior to polygenic scoring analyses; Fig. 3 demonstrates why this is one sensible analytical choice. However, this is not always done. Some researchers are unaware of the extent to which ancestry can impact variant frequencies, and others may choose not to split samples because there is no clear choice regarding how multiple admixed and/or similar populations should be split. In all instances, proper use of principal components (PCs) or other methods of correcting for ancestry is critical. Further, Fig. 3 implies that there is no single recommendation for the number of PCs needed, given that PCs correlate with 1000Genomes populations (see Supplementary Fig. 3). Underscoring this point, Supplementary Fig. 4 shows the magnitude of correlations between 1000Genomes participants’ polygenic scores (for height, BMI, and schizophrenia) and the first 20 PCs (N = 2577; see Supplementary Fig. 5 for representative scatterplots and Methods for additional details). Two key conclusions can be drawn from these figures. It is important that multiple, sometimes non-consecutive, PCs are correlated with polygenic scores for each of these phenotypes. For example, for height polygenic scores constructed with GIANT summary statistics, it is primarily the first PCs (1–4) that are significantly correlated with polygenic scores, but non-consecutive and later PCs are also correlated (e.g., PCs 7 and 12, in this example). Second, results vary somewhat across a range of p-value thresholds (p T ) used for constructing polygenic scores. These points underscore that using only a small (e.g., under 10) number of PCs may be inadequate for polygenic scoring analyses of mixed ancestry and admixed samples, and that careful inspection of data and plots is always needed.

Putative correlations between worldwide phenotypes and PRSs

Finally, we turn to the most difficult question: what causes differences in polygenic scores, as currently calculated, among different populations? Differences could be real or artifactual (i.e., due to bias in data and/or methods), and five categories of explanations are listed below.

(1) True differences due to drift (2) True differences due to selection (3) True differences in genetic effects due to environmental differences (gene-environment interactions) (4) Bias due to uncorrected population stratification in discovery and/or training samples (5) Bias due to discovery/training population data and/or polygenic scoring methods. Specifically, linkage disequilibrium (LD) structure and variant frequency are captured imperfectly with current methods (including genotyping and imputation), and they vary across populations, and currently available data resources are unequally representative of diverse worldwide populations. (6) Random error in the estimation of GWAS betas

Drift has been implicated as an explanation for population differences in polygenic scores among populations32, but others have reported that drift is insufficient to explain such differences33. Further, initial estimates of the strength of polygenic selection on height in European ancestry populations33,37 have recently been greatly reduced30,31, based on findings of uncorrected population stratification in summary statistics from the GIANT Consortium30,31. There is also disagreement about whether or not differences in average polygenic scores among populations might contribute to differences in phenotypic values among the same populations (which could also be due to environmental variation). Some have noted apparent positive correlations between average polygenic scores and phenotypes for BMI34, lupus35, and height as calculated using GIANT Consortium scores33,36,37. As described below, we include more data than used previously to address questions about potential correlations between worldwide height polygenic scores and height phenotypes.

Using 1000Genomes data (as described in the Methods) and commonly used but different methodological choices in the construction of polygenic scores, we demonstrate that no simple conclusions can be drawn about polygenic scores and height for worldwide populations. In Fig. 4 we plot average polygenic scores for height of 1000Genomes populations on the x-axis, using three sources of weights for constructing scores (PRS = polygenic risk score):

4a (top row) GIANT Consortium 18 based scores: PRS height_GIANT

4b (middle row) UKBiobank 44 based scores from the NealeLab: PRS height_UKBiobank

4c (bottom row) East Asian GWAS based scores45 from He et al: PRS height_EastAsian

Fig. 4 Scatterplots of height polygenic scores (x-axis) and phenotypic height (y-axis). Plots demonstrate that correlations between polygenic scores for height and height are not consistent across discovery GWAS. The y-values for height are the same for each plot and reflect average height of individuals in the country of origin for each population included. Average heights (y-axis) are from a different height GWAS used to construct polygenic scores (x-axis). Three different GWAS of height were used (i.e., three rows) with three different p-value thresholds (i.e., three columns) for the construction of polygenic scores. a GIANT-based polygenic scores for height. b UK Biobank-based polygenic scores for height. c East Asian based polygenic scores for height. The last two plots are missing because only genome-wide significant variants were available for the East Asian GWAS of height. p and r values for each plot are for correlation tests between polygenic scores for height (x-axis) and height (y-axis). GWAS=genome-wide association study, GIANT=Genetic Investigation of ANthropometric Trait, PRS=polygenic risk score, population abbreviations within scatterplots are those used by the 1000Genomes Consortium and are available in Supplementary Table 3 Full size image

On the y-axis, we plot average height for countries of origin for 1000Genomes populations, when available (see Methods for details and exclusions).

As shown in Fig. 4a, height phenotypes for worldwide populations (y-axis) are positively correlated with GIANT-based18 polygenic scores for height (x-axis), but not with UK-Biobank-based polygenic scores (4b) or East Asian GWAS based polygenic scores (4c). Polygenic scores constructed using only genome-wide significant variants from GIANT (top left) were positively correlated with height phenotypes (r = .67, p = .002), as were scores constructed using larger numbers of GIANT-based variants (e.g., all variants, top right, r = .59, p = .008). Results in 4b and 4c demonstrate that correlations (or lack of correlations) between height and polygenic scores for height are dependent on discovery GWAS. There are numerous reasons why polygenic scores differ between studies. However, recent findings suggest that correction for population stratification may not have been adequate in GIANT30,31, and therefore the positive correlations observed in 4a could be partially due to uncorrected population stratification. The dependence of correlation estimates on discovery GWAS is further illustrated in 4c, in which the point estimate for correlation between height and East Asian GWAS based polygenic scores for height is negative (r = −.11, p = .643). Power in discovery GWAS is also relevant, and greater confidence should be assigned to the results in 4a and 4b because both European ancestry discovery GWAS were adequately powered to detect hundreds of height loci, whereas the East Asian height GWAS was only adequately powered to detect 17 loci. Finally, methodological choices in polygenic score construction (see Fig. 3) must also be considered. The shape, dispersion, and even the ordering of distributions of polygenic scores for different 1000Genomes populations depends on polygenic score construction parameters, and would also necessarily result in different correlations with population phenotypes, and this is one additional reason why differences in polygenic scores among populations cannot be naively interpreted.

More research is needed to better understand the exact causes of differences in score distributions across populations and their putative relationships to phenotypes. Future research must also account for environmental effects on phenotypes, as well as variability in measurement validity and reliability across populations. Even for the relatively simple example of height (which is easily measured and for which major environmental influences are relatively well-understood) our analyses suggest that a great deal of caution should be used in drawing conclusions about polygenic score differences underling worldwide phenotypic differences, until data resources are significantly improved (i.e., well-powered GWAS in diverse populations), and until a deeper understanding of relevant population genetics principles has emerged. As discussed further below, even more caution will be required for other phenotypes such as psychiatric disorders.