In this section, we will demonstrate that in theory and in practice, most methods that use genetic associations are vulnerable to subtle, but important problems that derive from population structure. A key claim in this paper is that associations between genetic loci and traits have been reliably established, but estimates of effect sizes are less robust. Many uses of population structure depend crucially on unbiased effect size estimates.

The claim that population structure may have been underexplored is not new. It is now understood that structure may have led to different signatures of selection between UK Biobank and the GIANT consortium in height (Berg et al. 2018; Sohail et al. 2018). The problems may not be specific to the study of selection: Berg et al. note that “population structure corrections in GWAS may not always work exactly as expected” whilst Sohail et al. conclude that “polygenic adaptation signals based on large numbers of SNPs below genome-wide significance are extremely sensitive”. Population structure has been recently confirmed as a key part of the problem (Barton et al. 2019; Berg et al. 2019; Sohail et al. 2019), and other authors report residual associations between PCs, geography and traits in the UK Biobank (Haworth et al. 2018; Liu et al. 2018).

Population structure is correlated with phenotypes

To understand why effect estimates may be biased, it is helpful to revisit ideas in population genetics. Populations do differ genetically by genetic drift and/or selection, and as a consequence these populations will also have different genetic phenotypes. For example, ancient populations had different “genetic heights” (Mathieson et al. 2015), with some potentially being taller than any modern population. Height, and other traits, appear to be “omnigenic” (Boyle et al. 2017); that is, there is no region of the genome not in linkage disequilibrium (LD) with SNPs causal for these traits. Since modern populations are a mixture of older populations, SNPs causal for the trait are themselves correlated with ancestry. It follows that the estimate of the effect of a SNP on a trait can be an underestimate when correcting for population structure.

The justification for PCA correction for population structure (Price et al. 2006) is to correct for non-causal linear associations between ancestry and phenotype (Fig. 1a). Causality is hard to define because we rarely measure the exact cause, but proxy it; here we are interested in proxies that are genetic and act through biological pathways. A non-causal association can be generated when population structure is associated with both allele frequency and the phenotype (Fig. 1a). For example, genetic drift simultaneously changes phenotype and SNP frequencies by chance. Weak genetic drift as experienced by larger populations over short timescales is additive, which corresponds to an additive effect on PCs (McVean 2009). Larger genetic drift, as produced by extreme bottlenecks or consanguinity, is not additive as the SNP frequency distribution becomes skewed and SNPs may become fixed or lost from a population. PCA correction and related methods are less useful when such drifted populations are included (Lawson et al. 2018).

Fig. 1 Causal models including ancestry for the effect of a SNP (G) on a trait (T). a Correction for structure will be accurate when ancestry (A) is confounding T. b Correction for structure may give biased inference when ancestry is associated with the causal pathway (T A , which may not be measured) by which the SNP acts. For example, T = skin cancer is associated with T A = skin tone. c Correction for structure will be incomplete when ancestry is associated with the environment (E) due to shared history and geography (H), for example T = BMI with E = diet choice. d Correction for structure when using causal inference is robust to complexity, provided the assumptions of Mendelian randomization (see text) are met; particularly all remaining effects of ancestry go through the trait (T) so there is no direct effect of ancestry (A) on the outcome (O) Full size image

Admixture can change SNP frequencies genome-wide, and small admixture variation is ubiquitous. Even large modern human populations not homogeneous—each individual has a slightly different ancestry proportion from earlier populations. The most ancient detectible human admixture event—Neanderthal introgression into Eurasians—has a mean of around 2% (Sankararaman et al. 2014), but varies substantially between populations and individuals (Wall et al. 2013). Many features of Neanderthal ancestry can be correctly understood using GWAS, which is associated causally with some phenotypes including increasing the risk of depression (Simonti et al. 2016), and non-causally with others, for example skin color (because Neanderthal genes entered the modern human gene pool outside of Africa).

Admixture has the potential to interact with family studies. Siblings have the same expected value of ancestry, with them both receiving a random realized amount. Realized, rather than expected, ancestry is a better predictor of phenotypes (Speed and Balding 2015). Such admixture variation can tag an environmental covariate, for example alcohol consumption influenced by ALDH2 (Price et al. 2002). It could also tag another phenotype that has a confounding relationship, for example, when mixed-race siblings vary in skin tone they may experience different societal pressures (Song 2010), which would be plausibly associated with educational attainment (Light and Strayer 2002) and other phenotypes. A causal analysis would include this pathway—i.e., in the examples, ALDH2 is causal for alcohol consumption and skin tone for education. However, in the second example the inference does not fit our definition of being biologically caused since it is mediated solely through modifiable societal norms.

Whilst genetic drift can create phenotypic variation between populations, selection does so much more rapidly (Nielsen 2005). If a phenotype is under selection in a particular population, all SNPs that causally affect that phenotype (and also those in linkage disequilibrium) will change in frequency, inducing an association between ancestry and phenotype. Further, where some of the variants affecting a selected phenotype are pleiotropic or in LD with SNPs for another phenotype, selection can generate genetic associations between the phenotype under selection and other phenotypes. In extreme cases selection can lead to allele frequencies being almost perfectly correlated with population structure. The LCT gene (Bersaglieri et al. 2004) which is associated with lactase persistence, and similarly a variant in ADH1B (Galinsky et al. 2016) which influences alcohol metabolism, both stratify by population.

Impact on Mendelian randomization

Population structure bias has also been discussed in relation to Mendelian randomization (Davey Smith and Ebrahim 2003; Davey Smith and Hemani 2014; Didelez and Sheehan 2007; Lawlor et al. 2008), an approach which uses a SNP or groups of SNPs as an instrument or “proxy” to test whether an exposure causes an outcome. Mendelian randomization estimates the causal effect under the assumptions (Davies et al. 2018) that; (a) SNPs are associated with the exposure; (b) SNPs do not influence the outcome through a pathway independent of the exposure; (c) that there are no confounders of the SNPs–outcome relationship. Population substructure differences can in theory affect both the strength of genetic instruments and induce confounding, for example in the study of lactase persistence (Campbell et al. 2005; Davey Smith et al. 2009), but there is little evidence the problem is widespread.

The loci that are particularly useful for Mendelian randomization may be particularly susceptible to bias from population structure. This is because strong associations are generated through strong selection, which as discussed above is typically structured. For example, Mendelian randomization studies for alcohol consumption in Europeans typically use the variant in ADH1B as a genetic proxy (Holmes et al. 2014; Howe et al. 2019; Lawlor et al. 2013, 2014; Zuccolo et al. 2013). The ADH1B variant is associated with ancestry at the country and continental level (Li et al. 2011).

Understanding ancestry correction

For detection in GWAS, a sensible aim is to have the most stringent control of any potential bias, including for phenotype stratification. In addition to PCA correction for stratification, GWAS has also been controlled using genomic control (Devlin et al. 2001) which accounts for confounding by scaling test statistics using an inflation factor to ensure that “null” SNPs (as represented by the median) behave as expected under the null model. However, if all SNPs have a true effect this approach is under-powered. Linkage disequilibrium can be exploited to separate real from confounding signals, implemented in the popular tools LDAK (Speed et al. 2012; Speed and Balding 2019) and LDSC (Bulik-Sullivan et al. 2015). The premise is that if every SNP has an effect then SNPs that are in regions of higher LD will have larger measured associations because they are composites of their own effects and those around it. These methods confirmed that large-scale GWAS results detect real associations, but what about the size of the effect?

A central goal of genetic association studies is to estimate the “true” causal effect of a SNP (G) on a trait (T). The “true” effect is defined as the effect of G on T when all other traits that are not in the pathway between G and T (i.e., confounders) are accounted for (Fig. 1). Correction of GWAS for ancestry (A) is designed to remove non-causal associations when observable ancestry (C A , which might be PCs) not in the causal pathway (Fig. 1a). However, it also removes causal associations when ancestry is associated with traits in the pathway (Fig. 1b); a phenomenon often called vertical pleiotropy. Corrected estimates of the G–T associations exclude the G–A–T association. However, they also exclude the G–T A –T association and hence may under-estimate the effect size. For example, if G increases the risk of skin cancer by changing skin tone, its effect size will be underestimated if skin tone is predicted by ancestry. In general, because modern populations are mixtures of ancient populations, many SNPs with a biological effect (including ADH1B and Lactase) may associate with ancestry PCs due to having been common in only one ancestral population.

Ancestry can also associate with the environment (E) and hence also environmental confounders (C E ) (Fig. 1c). There is no causal relationship between A and T via C E and so if G is associated with E then correction is desirable to obtain a less biased estimate of the causal effect of G on T. However, the measured ancestry A is unlikely to account for all association between E and T, so observing an environmental effect indicates the need for additional phenotyping of that environment. For example, if a culture has a diet that reduces BMI then controlling for ancestry only partially corrects for diet. The same problem occurs if observable ancestry (e.g., PCs) do not completely capture the true ancestry.

Genome-wide genetic measures are strongly affected by population structure

A SNP–trait association estimate may be biased after ancestry correction when there is a correlation between the (true) SNP–trait effect and the contribution to an ancestry observable from the SNP (e.g., PC loading). There are many SNPs contributing to ancestry measures, so the bias for each SNP–trait estimate is likely to be small, but genome-wide estimates sum this bias. For example, heritability estimates can in theory be biased by population structure through the prediction of non-genetic covariates (Browning and Browning 2011; Dandine-Roulland et al. 2016), though the scale of the problem is not well quantified for most phenotypes. The robustness of heritability estimates to the existence of internal population structure can at least be tested (Speed et al. 2012, 2014).

Another genome-wide task is to use genetic data to directly predict phenotypes (called genomic prediction, Meuwissen et al. 2001). A predictor is learned using one dataset, then applied to genetic data from others which may be more or less similar in terms of the populations than make it up. This “out of sample” use case makes prediction particularly vulnerable to bias. As demonstrated in Fig. 2, conservative estimates of effect sizes are less useful than a bias–variance tradeoff accounting for the intended populations to be predicted. Adjustment for the PCs is likely to create a higher mean-square error, and it systematically reduces the variance explained in a heritability analysis. The model correcting for ancestry would be preferred for prediction only if (a) it contained enough predictive power to capture real phenotypic differences, and (b) the use case involved generalization into populations for which ancestry may have different effects; for example, predicting skin cancer would be concerning if the predicted population’s skin tone fell outside the range of study population or was caused by different underlying SNPs.

Fig. 2 When should we use PCA correction? a In simulation settings (see “Methods”) it is straightforward to construct scenarios where correction helps or hinders prediction of traits. Top: two populations are produced with different genetic phenotype, either by drift or selection. Middle: these are mixed to make modern populations. Bottom: in Case 1 the phenotype is associated with true population structure, which can be overcorrected. In Case 2 confounding non-genetic association is included in the prediction. b–d Show results for this simulation. b Correcting for confounding using PCA reduces prediction accuracy when traits are genetically associated with population structure. c Genetic structure can predict non-genetic confounding leading to apparently good performance on similarly biased populations. d PC correction can protect against this confounding at the cost of reduced performance Full size image

Genetic “prediction … is generally not robust to minor changes in the population” (Goddard et al. 2016). LD in Africans is lower than in Europeans, which makes prediction harder (de los Campos et al. 2010, 2015). A recent study claims that “effect sizes for European ancestry-derived polygenic scores are only 36% as large in African ancestry samples” (Duncan et al. 2018). Yet in consumer genomics (Multhaup and Lehman 2017) and many applications in medicine (Bloss et al. 2011) including drug response (Roden and George 2002), prediction is the primary goal, and ancestry is known to be important (Foll et al. 2014). Prediction is also important for ancient genomics, for example the recent reconstruction of the facial features and dark skin tone of “Cheddar Man” in Neolithic Britain (Brace et al. 2018). Prediction protocols typically involve summing the effect of all SNPs that are reliably associated with phenotype to make a “genetic score”—two early examples include coronary heart disease (Ripatti et al. 2010) and gout (Dehghan et al. 2008). However, other prediction models may be necessary when the genetic architecture of the trait does not follow the infinitesimal and additive assumptions (Morgante et al. 2018).

With the availability of much larger datasets, there is increasing discussion about whether polygenic risk prediction should be included in clinical care. For example, Khera et al. (2018) created a polygenic risk score consisting of millions of variants. The top 8% of the population by this score had comparable risk of coronary heart disease to carriers of rare monogenic mutations. While key coronary heart disease loci such as 9P21 have been replicated worldwide (Battram et al. 2018; Dong et al. 2013; Kral et al. 2011; Schunkert et al. 2011), the generalizability of polygenic risk scores of millions of SNPs across different populations requires further study.

Interaction between Mendelian randomization and population structure

Causal inference via Mendelian randomization exploits the effect of G on O that goes via the trait T. If the assumptions are met, Mendelian randomization estimates are robust to bias in G–T estimates, as long as there is no (uncorrected) direct A–O effect. One important mechanism by which G–O (gene–outcome) associations might go via A is linkage disequilibrium (LD). If the instrument SNP G is only a proxy for the true causal SNP, and LD differs between populations, then in theory G can be a strong genetic instrument in one population but weak in another. There is an absence of evidence for this phenomenon, likely due to a lack of large African datasets for whom LD is very different than Europeans.

In recent years, Mendelian randomization studies have increasingly used a two-sample design (Hartwig et al. 2016); in which estimates of the SNP–exposure and SNP–outcome relationships are taken from separate GWAS using non-overlapping samples. Here, an implicit assumption is that the two different samples used to estimate these relationships are drawn from the same underlying population. Typically, the two-sample design will use individuals of similar ancestry (e.g., restricting to individuals of recent European ancestry), however the effect of using two similar but ancestrally distinct samples within a broad definition such as Europeans is currently unclear. We will show below that even if the two samples are from the same population, there may be less power to detect the PCs in a smaller population. This can theoretically result in differential correction and, as a consequence, may bias causal estimates.

Simulating phenotypes with population stratification

For genome-wide questions including heritability analyses and prediction, it is easy to construct scenarios in which either correction or non-correction for structure can be misleading. Figure 2 describes two simulation scenarios: case 1 in which true genetic signal for a trait is associated with population structure (e.g., height), and case 2 in which population structure associates non-causally with the trait through the environment. PC correction is conservative when phenotypes are truly associated with ancestry (Fig. 2b). When ancestry is predictive of the environment (Fig. 2c) it can even increase genetic associations through non-causal pathways. However, when genes have moved into new environments, PC correction reduces bias (Fig. 2d).

Detecting population structure is essential for correcting for it

If not appropriately modeled, phenotype stratification can bias GWAS, heritability estimation, prediction, and Mendelian randomization. However, no single bias-correction approach is necessarily the correct choice for all scenarios. Even if the correct strategy is known, measurement of population structure is critical. As with any parameters estimated from a dataset, increasing sample size increases the ability to detect population structure (Patterson et al. 2006). Within the UK, there was no detectible structure in a subset of around 1000 people from the UK10K project (The UK10K Consortium 2015). However, with over 100,000 people (Galinsky et al. 2016) from the UK Biobank project (Sudlow et al. 2015) several axes of variation are visible in the PCs. Importantly, the latent structure proxied by these axes of variation were still in the data before they could be detected, and so correction on smaller datasets will systematically under-correct for stratification. This may explain why estimates from a single large study are different from a meta-analysis of smaller ones, though to our knowledge this has not been studied.

To the extent that detection of population structure is a problem, better methodology offers a solution. Methods based on “chromosome painting” (Lawson et al. 2012) exploit linkage disequilibrium to better detect population structure. Specifically, the approach counts recent sharing of segments of DNA that are identical by descent, rather than SNP frequencies, to detect recent structure. From the 2039 individuals in the People of the British Isles (PoBI) dataset (Leslie et al. 2015) there was only 1 geographically meaningful PC but over 50 populations detectable with chromosome painting. Studies sampled from a single location such as many cohort studies (for example ALSPAC, Boyd et al. 2013) are typically PCA corrected but the PCs are too weak to capture the real variation and thus the population is assumed to be “homogeneous”. There is a “detection threshold” in ancestry for which we can calculate the sample size required to detect ancestry variation of a given size (McVean 2009). To our knowledge, there has been no systematic study of the importance of residual population structure in small samples.

Exploiting the PoBI dataset, chromosome painting in ALSPAC (Haworth et al. 2018) (Fig. 3) reveals dramatic genetic heterogeneity which is associated with phenotype, here shown for educational attainment. In this case, the bias is predominantly associated with migration: people who move are more likely to be educated. In ALSPAC, genetic ancestry can predict 8% of the variation in education; for comparison, the most recent published whole-genome genetic score explains 3.2% (Okbay et al. 2016), and a mega-scale analysis is expected to generate a genetic score explaining 10% of the variance (Martin 2018). These results are based on meta-analyses of many studies, in which PC correction may not have sufficiently controlled for population structure.

Fig. 3 Population structure can be detected in ALSPAC using the external UK reference dataset PoBI and chromosome painting (see “Methods”). This structure is associated with phenotype, and is not found using regular PCA. a Inferred (see “Methods”) education level of people migrating from different regions of the UK into the ALSPAC cohort based in Bristol; scale is 1 = no education, 2 = vocational, 3 = GSCEs (age 16), 4 = further education (age 18), 5 = degree (reproduced from Haworth et al. 2018). Participants with ancestry further from Bristol have considerably higher education, suggesting differential migration by education. b Variance explained in education by chromosome painting PCs (8%) and regular PCA (0.8%). c The chromosome painting PC locations of individuals and populations for chromosome painting PC 3 and 5, which have the largest associations with education. PoBI mean label locations are shown, along with ALSPAC individuals (white dots) and a kernel smoothing of education Full size image

It is unclear how many of these GWAS hits are in fact hits for ancestry and hidden population structure or migration. The problem exists in many other phenotypes (Martin et al. 2017): for example “height is predicted to decrease with genetic distance from Europeans” which is not empirically observed. Interpretation of these results is left to the discussion.