African risk allele frequencies differ from other continents

We tested whether there are any systematic biases in genetic estimates of disease risk by analyzing allele frequencies at 3036 GWAS loci for each continental population in the 1000 Genomes Project. Contrary to null expectations, mean risk allele frequencies are not the same for each population (Fig. 2a). Overall, African populations have significantly higher risk allele frequencies compared to non-African populations (mean difference + 1.15%, p value = 0.0213, paired Wilcoxon signed-rank test). Population-level differences in risk allele frequencies persist when disease associations are binned into seven different categories. Compared to other populations, African populations have the highest risk allele frequency for metabolic (p value = 0.0055), morphological (p value = 0.0949), cancer (p value = 0.1169), neurological (p value = 0.0995), and miscellaneous (p value = 0.3865, paired Wilcoxon signed-rank tests) diseases. African populations have intermediate frequencies of risk alleles at the loci that are associated with GI or liver diseases (p value = 0.6965) and lower frequencies of risk alleles at the loci that are associated with cardiovascular disease (p value = 0.0140, paired Wilcoxon signed-rank tests). These statistical comparisons reflect allele frequency differences at individual SNPs. Among non-African populations there is no underlying trend. Some of the continental patterns described here are at odds with clinical data (e.g., health disparities involving cardiovascular disease in African-Americans [43]). This discrepancy between clinical data and allele frequencies suggests that genetic disease risks may be misestimated for individuals with African ancestry.

Fig. 2 Known disease associations lead to misestimates of genetic disease risks. a Risk allele frequencies at published disease-associated loci from the NHGRI-EBI GWAS Catalog vary by population. “*” indicates a statistically significant allele frequency difference between African and non-African populations (p values < 0.05, paired Wilcoxon rank sum tests). n = number of disease-associated loci per disease category. b Proportion of disease-associated loci where the risk allele is ancestral, as opposed to derived Full size image

Disease categories that have a larger proportion of ancestral alleles tend to have elevated risk allele frequencies in Africa (Fig. 2b). After binning GWAS loci by disease category, we find that the differences in the mean frequency of risk alleles between African and non-African populations are highly correlated with the proportion of risk alleles that are ancestral (r2 = 0.842). Accurate estimation of genetic disease risks across global populations may hinge upon knowledge of whether risk-increasing alleles are ancestral or derived.

Ancestral and derived alleles yield different patterns of genetic disease risk

For loci that are not associated with any disease, the null expectation is that ancestral and derived allele frequencies will be broadly similar across global populations. Just because Homo sapiens emerged in Africa does not mean that African genomes have an excess of ancestral alleles—all human populations share the same evolutionary distance to chimpanzees. Due to the out-of-Africa bottleneck, African genomes are more likely to be heterozygous for derived alleles, and non-African genomes are more likely to be homozygous for derived alleles. Examining WGS data from the 1000 Genomes Project, we find that derived allele frequencies (DAF) are similar for each population (Fig. 3a). However, disease-associated loci need not exhibit the same pattern.

Fig. 3 Empirical patterns depend on whether disease-associated alleles are ancestral or derived. a Mean derived allele frequencies of non-disease SNPs from whole genome sequencing and genotyping arrays. 1000 Genomes Project data are shown. b Joint SFS of published GWAS loci. Ancestral risk alleles are labeled red and derived risk alleles are labeled blue. c The frequencies of ancestral risk alleles are higher in Africa (+ 9.51% on average), and the frequencies of derived risk alleles are lower in Africa (− 5.40% on average). Dashed lines indicate mean values. d Continental differences in risk allele frequencies are minimal for young SNPs. Disease-associated loci are binned by DAF and whether risk alleles are ancestral or derived Full size image

The joint site frequency spectrum (SFS) enables the frequencies of individual risk alleles to be compared between African and non-African populations. Similar numbers of disease associations are found above and below the diagonal in Fig. 3b. However, conditioning on whether risk alleles are ancestral or derived reveals a striking pattern: 69.2% of ancestral risk alleles are found at higher frequency in African populations (red dots below the diagonal), and 64.5% of derived risk alleles are found at higher frequency in non-African populations (blue dots above the diagonal). The magnitudes of allele frequency differences between populations also vary for ancestral and derived risk alleles. We find that ancestral risk alleles are found at much higher frequencies in Africa, and derived risk alleles are found at moderately lower frequencies in Africa (Fig. 3c). Specifically, the mean difference in ancestral risk allele frequencies between African and pooled non-African populations is + 9.51%, and the mean difference in derived risk allele frequencies between African and pooled non-African populations is − 5.40% (p value < 2.2 × 10−16 for both comparisons, Wilcoxon signed-rank tests). The overall continental difference in risk allele frequencies of + 1.15% arises because 44% of presently known disease-associated loci have ancestral risk alleles.

Derived allele frequencies serve as proxies for SNP age [44], and we find that older disease-associated loci are more likely to have large differences in continental allele frequencies. For each 20% DAF bin (pooled data), we calculated the difference in risk allele frequencies between African and non-African populations. In sharp contrast to other DAF bins, published disease loci with DAF ≤ 0.2 exhibit only a small amount of bias (Fig. 3d). This pattern occurs regardless of whether risk alleles are ancestral or derived. Note that SNPs with DAF ≤ 0.2 tend to be younger than 125,000 years old, assuming an effective population size of 10,000 individuals and generation times of 25 years [44].

Choice of study population contributes to misestimates of genetic disease risk

Most disease associations have been discovered in study cohorts with European ancestry, and this can bias the estimation of genetic disease risks in diverse global populations. Empirical data reveal the effects of GWAS study populations; many disease-associated alleles segregate at intermediate frequencies in non-African populations but are found at extremely low or high frequencies in Africa (compare the vertical and horizontal borders of Fig. 3b). This occurs because statistical power is maximized at intermediate frequencies, and most disease-associated loci have been discovered in non-African populations. Existing GWAS have discovered relatively few disease alleles that segregate only in African populations.

To further isolate the effects of different study populations, we simulated a large number of GWAS results, varying the continental ancestry of each study cohort. Importantly, our GWAS simulations did not assume that there are any underlying differences in hereditary disease risks across populations. We find that computer simulations recapitulate empirical patterns at known disease loci and that GWAS of bottlenecked non-African populations yield different results than GWAS of African populations (Fig. 4). Simulated GWAS that use an African (AFR) cohort yield similar risk allele frequencies across each of the five continental populations. However, simulated GWAS that use American (AMR), East Asian (EAS), European (EUR), or South Asian (SAS) cohorts produce a set of disease-associated loci with elevated frequencies of ancestral risk alleles in Africa (Fig. 4a) and reduced frequencies of derived risk alleles in Africa (Fig. 4b). These simulation results indicate that systematic allele frequency differences between populations need not be due to any underlying difference in risk (recall that our simulations did not assume the existence of any underlying differences in disease risks across populations). The effects of European study cohorts are still seen when GWAS simulations use data from WGS, as opposed to genotyping arrays (Table 1). We also find that continental differences in risk allele frequencies occur if GWAS simulations use a more stringent p value filter, or simulations assume different modes of inheritance, including dominant or recessive disease alleles (Additional file 1: Table S1 and Additional file 2: Table S2). Additionally, GWAS simulations of study cohorts that contain a mixture of individuals from different populations still yield disease-associated loci with continental biases in risk allele frequencies (MIX in Fig. 4). These results suggest that pooling samples with different ancestries is unlikely to completely alleviate the problem of misestimating genetic disease risks. Regardless of the choice of study cohort, allele frequencies are similar for each non-African population, reflecting the relatively recent divergence times between these populations.

Fig. 4 GWAS simulations reveal the effects of different study cohorts. Mean risk allele frequencies in different continental populations are shown for each study cohort (3036 disease associations per simulation). Despite the absence of any underlying differences in risk, disease-associated loci that are detected in non-African study cohorts have biased frequencies. a GWAS simulations where the ancestral allele increases risk. b GWAS simulations where the derived allele increases risk Full size image

Table 1 Differences in allele frequencies between African and European populations for different genotyping technologies Full size table

We also examined the effects of genotype-by-environment (GxE) interactions by allowing effect sizes to vary by population in our GWAS simulations. In general, results from these simulations mirror the results of other simulations; ancestral risk allele frequencies are higher in African populations than non-African populations, and derived risk allele frequencies are lower in African populations than non-African populations (Additional file 3: Figure S1). Compared to African study cohorts, European study cohorts magnify these allele frequency differences between populations. Choice of study cohort imposes a filter on effect sizes, as SNPs with very small effect sizes do not yield detectable associations (compare gray pre-GWAS effect sizes to red and blue post-GWAS effect sizes in Additional file 3: Figures S1-S3). Large effect sizes enable high-frequency ancestral alleles and low-frequency derived alleles to be detected in a GWAS. The results described above are also robust to systematic biases in effect sizes, i.e., scenarios where pre-GWAS European effect sizes tend to be larger than African effect-sizes or vice versa (Additional file 3: Figures S2 and S3).

Genotyping arrays and SNP ascertainment bias cause disease risks to be misestimated

Many commonly used genotyping arrays contain SNPs that were ascertained in a relatively small number of European individuals. This ascertainment bias results in allele frequency distributions that vary by genotyping platform. Compared to WGS data, derived allele frequencies are higher for SNPs on the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray. SNPs on genotyping arrays also exhibit continental biases (Fig. 3a). Specifically, we find that derived allele frequencies in African populations are markedly lower than derived allele frequencies in non-African populations (p value < 2.2 × 10−16 for both arrays, Wilcoxon signed-rank tests).

The joint SFS of non-African and African populations further reveals the effects of SNP ascertainment bias. Examining WGS data, we find that similar numbers of SNPs have elevated derived allele frequencies in non-African and African populations (Additional file 3: Figure S4a). By contrast, the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray are enriched SNPs with higher derived allele frequencies outside of Africa (i.e., SNPs above the diagonal in Additional file 3: Figure S4b and Additional file 3: Figure S4c). Importantly, this pattern mirrors what is seen for empirical GWAS data (Additional file 3: Figure S4d), which suggests that genotyping arrays contribute to continental differences in risk allele frequencies at known disease-associated loci.

Because many disease-associations involve imputed SNPs, we also tested whether continental differences in risk allele frequencies persist for disease-associated loci that are not on the Affymetrix Genome-Wide Human SNP 6.0 Array. For this empirical set of disease-associated loci, we find that sites with ancestral risk alleles have higher allele frequencies in Africa (+ 8.63% on average) and that SNPs with derived risk alleles have lower allele frequencies in Africa (− 4.83% on average). This suggests that biases persist even for imputed SNPs.

Continental differences in allele frequencies persist even if whole genome sequencing and large sample sizes are used

Simulations of GWAS results were used to infer the extent that misestimates of disease risks depend upon genotyping technology (Table 1). Here, simulations assume European ancestry for each study cohort and sample sizes of 3500 cases and 3500 controls. We find that different genotyping arrays yield similar results: the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray yield ancestral risk allele frequencies that are 10.7% and 11.0% higher in Africa and derived risk alleles that are 8.0% and 8.2% higher in Europe, respectively. Somewhat surprisingly, continental differences in allele frequencies also occur for GWAS simulations that use WGS data. Focusing on WGS GWAS simulations, ancestral risk allele frequencies are 9.7% higher in Africa, and derived risk alleles are 7.2% higher in Europe. These patterns arise because of our choice of study cohort and because sample sizes of 3500 cases and 3500 controls have relatively little power to catch rare disease alleles.

Continental biases in risk allele frequencies occur even if GWAS use large sample sizes. Simulated GWAS with less than 10,000 European cases and controls yield large differences in African and non-African allele frequencies (Fig. 5). This occurs regardless of whether simulations use SNPs from the genotyping arrays or WGS. Increasing GWAS sample sizes increases the statistical power to detect associations with rare alleles. However, our simulations reveal that there are diminishing returns for increasing sample sizes, especially if GWAS use genotyping arrays. Well-powered studies with hundreds and thousands of cases and controls still yield notable differences in continental allele frequencies—even if WGS are used (Fig. 5). These results indicate that WGS is unable to completely mitigate the effects different study populations.

Fig. 5 GWAS simulations reveal that continental differences in allele frequencies persist even if whole genome sequencing and large sample sizes are used. Bean plots show the results of 1000 simulations per set of parameter values (3036 disease associations per simulation). Simulations using SNPs on genotyping arrays are represented by light shading, and simulations using WGS data are represented by dark shading. Colors indicate whether risk alleles are ancestral (red) or derived (blue). Sample sizes shown are the number of cases and the number of controls Full size image

Correcting for ancestral and derived risk alleles leads to improved genetic risk scores

Standardized genetic risk scores (GRS) were generated for 2504 individuals and 7 different disease categories. This involved integrating a curated list of disease-associated loci from the NHGRI-EBI GWAS Catalog with individual-level genotype data from the 1000 Genomes Project. Positive GRS values indicate genomes that contain more risk alleles than the global mean, and negative GRS values indicate genomes that contain less risk alleles than the global mean. Standardized GRS are scaled in terms of standard deviations from the mean, i.e., they are Z-scores. In general, different populations have GRS distributions that mirror what is seen for allele frequency data (compare Fig. 6 to Fig. 2a). We find that African individuals have uncorrected GRS that differ from other populations (p value = 0.0037 for GI or liver diseases and p value < 2.2 × 10−16 for all other disease categories, Mann-Whitney U tests). These differences are larger for metabolic, cancer, and cardiovascular disease risks. There is a substantial amount of overlap between the GRS distributions of each non-African population, and this pattern occurs for all disease categories. Within each population, there is also a large range of GRS values. Also note that admixed genomes from the Americas (AMR in Fig. 6) have GRS that are broadly similar to other non-African genomes. Although GRS reflect an individual’s genetic propensity for different disease categories, we caution against over-interpreting these results. This is because GRS have been built from a biased set of disease-associated loci.

Fig. 6 Genetic risk scores (GRS) before and after correcting for ancestral and derived risk alleles. GRS probability densities for each continental population are shown (solid lines, uncorrected GRS; dashed lines, corrected GRS for African genomes). n = number of disease-associated loci per disease category. Arrows indicate the shift in African GRS after correcting for whether risk alleles are ancestral or derived. “*” indicates uncorrected African GRS that are significantly different than non-African GRS, and “©” indicates corrected African GRS that are significantly different than non-African GRS (p values < 0.05, Mann-Whitney U tests) Full size image

GRS corrections reduce some, but not all, of the population-level differences in predicted disease risks. Here, we compensate for continental differences in ancestral and derived risk allele frequencies by generating corrected GRS for African genomes. We find that African individuals have corrected GRS that are similar to other populations for metabolic (p value = 0.8080), morphological (p value = 0.0671), and neurological (p value = 0.7116, Mann-Whitney U tests) disease risks. By contrast, African individuals have corrected GRS that are different than other populations for GI or liver, cancer, miscellaneous, and cardiovascular disease risks (p value < 2.2 × 10−16 for each disease category, Mann-Whitney U tests). Corrections involve in a leftward shift in the GRS of African genomes, the magnitude of which depends on the proportion of ancestral risk alleles for each disease category (compare the size of arrows in Fig. 6). We observe three different outcomes: minimal effects, over-correction, and reduction of bias. Cardiovascular risk predictions for African genomes were largely unchanged (i.e., GRS still appear to underestimate the risks of cardiovascular disease in individuals of African descent). Two disease categories (GI or liver and miscellaneous diseases) have corrected GRS distributions that differ more between African and non-African populations than uncorrected GRS distributions. The remaining four disease categories (metabolic, morphological, cancer, and neurological diseases) have corrected GRS distributions that overlap heavily with other populations. Although the correction method used here alleviates some forms of bias, our results suggest that GRS can be further improved by considering additional parameters.