Significance Measurements of prehistoric human skeletal remains provide a record of changes in height and other anthropometric traits over time. Often, these changes are interpreted in terms of plastic developmental response to shifts in diet, climate, or other environmental factors. These changes can also be genetic in origin, but, until recently, it has been impossible to separate the effects of genetics and environment. Here, we use ancient DNA to directly estimate genetic changes in phenotypes and to identify changes driven not by genetics, but by environment. We show that changes over the past 35,000 y are largely predicted by genetics but also identify specific shifts that are more likely to be environmentally driven.

Abstract The relative contributions of genetics and environment to temporal and geographic variation in human height remain largely unknown. Ancient DNA has identified changes in genetic ancestry over time, but it is not clear whether those changes in ancestry are associated with changes in height. Here, we directly test whether changes over the past 38,000 y in European height predicted using DNA from 1,071 ancient individuals are consistent with changes observed in 1,159 skeletal remains from comparable populations. We show that the observed decrease in height between the Early Upper Paleolithic and the Mesolithic is qualitatively predicted by genetics. Similarly, both skeletal and genetic height remained constant between the Mesolithic and Neolithic and increased between the Neolithic and Bronze Age. Sitting height changes much less than standing height—consistent with genetic predictions—although genetics predicts a small post-Neolithic increase that is not observed in skeletal remains. Geographic variation in stature is also qualitatively consistent with genetic predictions, particularly with respect to latitude. Finally, we hypothesize that an observed decrease in genetic heel bone mineral density in the Neolithic reflects adaptation to the decreased mobility indicated by decreased femoral bending strength. This study provides a model for interpreting phenotypic changes predicted from ancient DNA and demonstrates how they can be combined with phenotypic measurements to understand the relative contribution of genetic and developmentally plastic responses to environmental change.

Stature, or standing height, is one of the most heavily studied human phenotypes. It is easy to measure in living individuals and relatively straightforward to estimate from skeletal remains. As a consequence, geographic variation and temporal changes in stature are well documented (1⇓–3), particularly in western Europe, where there is a comprehensive record of prehistoric changes (4). The earliest anatomically modern humans in Europe, present by 42,000 to 45,000 y before present (BP) (5, 6), were relatively tall (mean adult male height in the Early Upper Paleolithic was ∼174 cm). Mean male stature then declined from the Paleolithic to the Mesolithic (∼164 cm) before increasing to ∼167 cm by the Bronze Age (4, 7). Height can respond rapidly in a developmentally plastic manner to changes in environment, as demonstrated by large increases in Europe, and worldwide, during the secular trends of the 19th and 20th centuries (1, 4). In European countries today, mean adult male height is ∼170 to 180 cm (1). It is broadly agreed that prehistoric changes were likely to have been driven by a combination of environmental (e.g., climate or diet) and genetic factors including drift, admixture, and selection (4, 7⇓–9), although the effects of these variables cannot be separated based on skeletal data alone. In this study, by combining the results of genome-wide association studies (GWAS) with ancient DNA (aDNA), we directly estimate the genetic component of stature and test whether population-level skeletal changes between ∼35,000 and 1,000 BP are consistent with those predicted by genetics.

Height is highly heritable (10⇓⇓⇓–14) and therefore amenable to genetic analysis by GWAS. With sample sizes of hundreds of thousands of individuals, GWAS have identified thousands of genomic variants that are significantly associated with the phenotype (15⇓–17). Although the individual effect of each of these variants is tiny [on the order of ±1 to 2 mm per variant (18)], their combination can be highly predictive. Polygenic risk scores (PRS) constructed by summing together the effects of all height-associated variants carried by an individual can now explain upwards of 30% of the phenotypic variance in populations of European ancestry (16). In effect, the PRS can be thought of as an estimate of “genetic height” that predicts phenotypic height, at least in populations closely related to those in which the GWAS was performed. One major caveat is that the predictive power of PRS is much lower in other populations (19). The extent to which differences in PRS between populations are predictive of population-level differences in phenotype is currently unclear (20). Recent studies have demonstrated that such differences may partly be artifacts of correlation between environmental and genetic structure in the original GWAS (21, 22). These studies also suggested best practices for PRS comparisons, including the use of GWAS summary statistics from large homogenous studies (instead of metaanalyses), and replication of results using summary statistics derived from within-family analyses that are robust to population stratification.

Bearing these caveats in mind, PRS can be applied to ancient populations thanks to recent technological developments that have dramatically increased aDNA sample sizes. These have provided remarkable insights into the demographic and evolutionary history of both modern and archaic humans across the world (23⇓–25), particularly in Europe, and allow us to track the evolution of variants underlying phenotypes ranging from pigmentation to diet (26⇓⇓–29). In principle, PRS applied to ancient populations could similarly allow us to make inferences about the evolution of complex traits. A few studies have used PRS to make predictions about the relative statures of ancient populations (29⇓–31) but looked at only a few hundred individuals in total and did not compare their predictions with stature measured from skeletons. Here, we compare measured skeletal data to genetic predictions and directly investigate the genetic contribution to height independent of environmental effects acting during development.

Discussion We showed that the well-documented temporal and geographic trends in stature in Europe between the EUP and the post-Neolithic period are broadly consistent with those that would be predicted by PRS computed using present-day GWAS results combined with aDNA. However, because of the limited predictive power of current PRS, we cannot provide a quantitative estimate of how much of the variation in phenotype between populations might be explained by variation in PRS. Similarly, we cannot say whether the changes were continuous, reflecting evolution through time, or discrete, reflecting changes associated with known episodes of replacement or admixture of populations that have diverged genetically over time. Finally, we find cases where predicted genetic changes are discordant with observed phenotypic changes—emphasizing the role of developmental plasticity in response to environmental change and the difficulty in interpreting differences in PRS in the absence of phenotypic data. Our results indicate 2 major episodes of change in genetic height. First, there was a reduction in standing-height PRS—but not sitting-height PRS—between the EUP and LUP, coinciding with a substantial population replacement (33). These genetic changes are consistent with the decrease in stature—driven by leg length—observed in skeletons during this time period (4, 64, 74, 75). One possibility is that the stature decrease in the ancestors of the LUP populations could have been adaptive, driven by changes in resource availability (76) or to a colder climate (61). Comparison between patterns of phenotypic and genetic variation suggest that, on a broad scale, variation in body proportions among present-day people reflects adaptation to environment largely along latitudinal gradients (77, 78). EUP populations in Europe would have migrated relatively recently from more southern latitudes and had body proportions that are typical of present-day tropical populations (75). The populations that replaced them would have had more time to adapt to the colder climate of northern latitudes. On the other hand, we do not find genetic evidence for selection on stature during this time period—suggesting that the changes could have been neutral and not adaptive. The second episode of change in height is either between the Neolithic and post-Neolithic, or during the post-Neolithic period. This period is characterized by the eastward movement of substantial amounts of “Steppe ancestry” into Central and Western Europe (27, 30, 38, 50). Our results are thus consistent with previous results that migration and admixture from Bronze Age populations of the Eurasian steppe increased genetic height in Europe (29, 30). Whether this increase was driven by selection in the ancestors of these populations remains unresolved. The geographic gradient of increasing skeletal stature is unclear in the Paleolithic, largely west to east in the Mesolithic (7, 64) and largely south to north by the Bronze Age (4, 7, 9). Latitudinal, but not longitudinal, patterns are qualitatively consistent with geographic patterns in PRS, suggesting that, like temporal variation, both genetics and environment contribute to geographic variation. A major confounding factor in analysis of temporal and geographic variation in PRS, particularly in the Bronze Age, is genetic population structure. Present-day European population structure is correlated with geography and largely driven by variation in Steppe ancestry proportion, with more Steppe ancestry in Northern Europe and less in Southern Europe (38). Suppose that environmental variation in stature is also correlated with geography, and that Northern Europeans are taller than Southern Europeans for entirely nongenetic reasons. Then, GWAS that do not completely correct for stratification will find that genetic variants that are more common in Steppe populations than Neolithic populations are associated with increased height. When these GWAS results are used to compute PRS for ancient populations, they will predict that Steppe ancestry populations were genetically taller simply because they are more closely related to present-day Northern Europeans (21, 22). In this study, we attempted to avoid this confounding in 2 ways: first, by using GWAS effect sizes from the UK Biobank—a homogenous dataset that should be well controlled for population stratification, and second, by replicating our results after reestimating the effect sizes within siblings, which should be robust to population stratification, although less precise. However, we cannot exclude the possibility that some confounding remains, for example, because although we re-estimated effect sizes using the within-siblings design, we still ascertained loci using the GWAS results. A related concern is that, even in the absence of bias, variance explained by the PRS is likely to decrease as we move back in time and ancient populations become less closely related to present-day populations (19). However, our results indicate that the PRS still captures enough of the genetic variance to predict changes in phenotype even though, presumably, we would gain even more resolution using a PRS that captured more of the genetic variation in the phenotype. As well as genetic contributions to phenotype, our results shed light on possible environmental contributions. In some cases, we can make hypotheses about the relationship between environmental or lifestyle changes, and genetic change. For example, if we interpret change in femur bending strength as reflecting a decrease in mobility, the coincident Mesolithic/Neolithic change in hBMD PRS can be seen as a genetic response to this change. However, in the Neolithic/post-Neolithic periods, the 2 observations are decoupled. This emphasizes the role of developmental plasticity in response to changes in environment, and of joint interpretation of phenotypic and genetic variables. Even when looking at the same phenotype, we find cases where genetic predictions and phenotypic data are discordant—for example, in post-Neolithic sitting height. We must therefore be cautious in the interpretation of predicted genetic patterns where phenotypes cannot be directly measured, even if it is possible to control stratification. Predicted genetic changes should be used as a baseline, against which nongenetic effects can be measured and tested.

Methods aDNA and Polygenic Risk Score Construction. We collected published aDNA data from 1,071 ancient individuals, taken from 29 publications. The majority of these individuals had been genotyped using an in-solution capture reagent (“1240k”) that targets 1.24 million SNPs across the genome. Because of the low coverage of most of these samples, the genotype data are pseudohaploid. That is, there is only a single allele present for each individual at each site, but alleles at adjacent sites may come from either of the 2 chromosomes of the individual. For individuals with shotgun sequence data, we selected a single read at each 1240k site. We obtained the date of each individual from the original publication. Most of the samples have been directly radiocarbon dated, or else are securely dated by context. We summarized the genetic relationships between ancient and present-day groups by computing F ST using smartpca v16000 (79) (SI Appendix, Table S1) and multidimensional scaling using pairwise distances computed using plink v1.90b5.3 (options—distance flat-missing 1-ibs) (80) (SI Appendix, Fig. S1C) and unsupervised ADMXITURE (81) (SI Appendix, Fig. S1D). We obtained GWAS results from the Neale Laboratory UK Biobank page (http://www.nealelab.is/uk-biobank/; round 1, accessed February and April 2018). To compute PRS, we first took the intersection of the 1240k sites and the association summary statistics. We then selected a list of SNPs to use in the PRS by selecting the SNP with the lowest P value, removing all SNPs within 250 kb, and repeating until there were no SNPs remaining with P value less than 10−6. We then computed PRS for each individual by taking the sum of genotype multiplied by effect size for all included SNPs. Where an individual was missing data at a particular SNP, we replaced the SNP with the average frequency of the SNP across the whole dataset. This has the effect of shrinking the PRS toward the mean and should be conservative for the identification of differences in PRS. We confirmed that there was no correlation between missingness and PRS, to make sure that missing data did not bias the results (correlation between missingness and PRS, ρ = 0.02; P = 0.44, SI Appendix, Fig. S11). Finally, we normalized the PRS across individuals to have mean 0 and SD 1. We estimated within-family effect sizes from 17,358 sibling pairs in the UK Biobank to obtain effect estimates that are unaffected by stratification. Pairs of individuals were identified as siblings if estimates of IBS0 were greater than 0.0018 and kinship coefficients were greater than 0.185. Of those pairs, we only retained those where both siblings were classified by UK Biobank as “white British,” and randomly picked 2 individuals from families with more than 2 siblings. We used Hail (82) to estimate within-sibling pair effect sizes for 1,284,881 SNPs by regressing pairwise phenotypic differences between siblings against the difference in genotype. We included pairwise differences of sex (coded as 0/1) and age as covariates, and inverse-rank–normalized the phenotype before taking the differences between siblings. To combine the GWAS and sibling results, we first restricted the GWAS results to sites where we had estimated a sibling effect size and replaced the GWAS effect sizes by the sibling effects. We then restricted to 1240k sites and constructed PRS in the same way as for the GWAS results. To test whether the differences in the GWAS and GWAS/Sibs PRS results can be explained by differences in power, we created subsampled GWAS estimates that matched the sibling in the expected SEs, by determining the equivalent sample size necessary and randomly sampling N s u b individuals. N s u b = N s i b / ( 2 v a r ( δ s i b ) ) , where δ s i b is the difference in normalized phenotype between siblings after accounting for the covariates age and sex. Stature Data. We obtained stature data from Ruff (2018) (4) (data file and notes available at https://www.hopkinsmedicine.org/fae/CBR.html), which also includes estimated body mass, femoral midshaft anteroposterior strength (FZx), and other osteometric dimensions. Statures and body masses were calculated from linear skeletal measurements using anatomical reconstruction or sample-specific regression formulae (4, 58). We calculated sitting height as basion–bregma (cranial) height plus vertebral column length. Analysis was restricted to 1,159 individuals dated earlier than 1165 BP (651 males and 508 females), of which 1,130 had estimates for stature, 1,014 for FZx, and 236 for sitting height. Sitting and standing height were standardized for sex by adding the mean difference between male and female estimates to all of the female values. Sex differences in stature remain relatively constant over time (4), making it reasonable to adjust all female heights by the same mean value. Male/female counts in each period were as follows: EUP, 13/10; LUP, 15/8; Mesolithic, 56/37; Neolithic, 130/91; and post-Neolithic, 437/362. For FZx, we first standardized for sex as we did for stature, then divided each by estimated body mass multiplied by biomechanical femur length (4). Grouping. We grouped individuals into broad categories based on date and, in some cases, archeological and genetic context. All individuals were assigned to one time period group, based on median age estimates of the sample obtained from the original publications. Date ranges for each time period are based on a combination of historical, climatic, and archaeological factors. The EUP comprises all samples older than 25,000 BP, which roughly coincides with the end of the Last Glacial Maximum. The LUP begins when the European glaciers are beginning to recede (25,000 BP) and extends until 11,000 BP and a shift in lithic technology that is traditionally used to delineate the beginning of the Mesolithic period. Transitions between the Mesolithic, Neolithic, and Bronze Age are staggered throughout Europe, so creating universally applicable date ranges is not possible. We instead defined overlapping transition periods between the Mesolithic and Neolithic periods (8500 to 5500 BP) and between the Neolithic and post-Neolithic (5000 to 3900 BP). For the genetic data, samples in the overlapping periods were assigned based on genetic population affiliation, inferred using supervised ADMIXTURE (81), which, in most of Western Europe, corresponds closely to archaeological context (38, 48). In particular, the Mesolithic/Neolithic overlap was resolved based on whether each individual had more (Neolithic) or less (Mesolithic) than 50% ancestry related to northwest Anatolian Neolithic farmers. The Neolithic/post-Neolithic overlap was resolved based on whether individuals had more than 25% ancestry related to Bronze Age Steppe populations (“Steppe ancestry”; see ref. 83 for details). For the skeletal data, group assignment in the overlapping periods was determined by the archaeology of each site. Broadly, sites belonging to the Neolithic have transitioned to agricultural subsistence. Similarly, post-Neolithic populations are broadly defined by evidence of metal working (Copper, Bronze, and Iron Ages, and later periods). In particular, we included Late Eneolithic (Copper Age) sites associated with Corded Ware and Bell Beaker material culture in the post-Neolithic category, but for consistency with the genetic classifications, we included 8 Early Eneolithic (before 4500 BP) individuals in the Neolithic category, since this precedes the appearance of Steppe ancestry in Western Europe. We excluded samples more recent than 1165 BP. Linear Models. We fitted a series of linear models to changes in both PRS and stature data with time. In the most general model, we allow both the intercept and slope to vary between groups. We then either force some of the slopes to be zero, or some of the adjacent groups to have identical parameters. We describe the models using underscores to indicate changes in parameters, lowercase to indicate slopes (change with respect to time) fixed to zero, and uppercase to indicate free slopes (i.e., linear trends with time). For example, “E_L_M_N_B” is the most general model, “elmnb” indicates that all groups have the same mean and there is no change with time, and “ELMN_B” indicates that the first 4 groups share the same parameters, and the post-Neolithic has different parameters. The models shown in Figs. 1 and 2 are “e_lmn_b” (A and B), “e_lm_nb” (C), “ELMN_B” (D and E), and “ELM_NB” (F). To analyze geographic variation, we used the residuals of the “ELMN_B” model for the PRS and “ELM_NB” for skeletal stature, and fitted regressions against latitude and longitude. Polygenic Selection Test. We computed bootstrap P values for the Q x statistic (73) by recomputing the statistic for random sets of SNPs in matched 5% derived allele frequency bins (polarized using the chimpanzee reference gnome panTro2). For each bootstrap replicate, we keep the original effect sizes but replace the frequencies of each SNP with one randomly sampled from the same bin. Unlike the PRS calculations, we ignored missing data, since the Q x statistic uses only the population-level estimated allele frequencies and not individual-level data. We tested a series of nested sets of SNPs (x axis in Fig. 5), adding SNPs in 100 SNP batches, ordered by increasing P value, down to a P value of 0.1. Simulated GWAS Data. We simulated GWAS, generating causal effects at a subset of around 159,385 SNPs in the intersection of SNPs, which passed QC in the UK Biobank GWAS, are part of the 1240 k capture, and are in the POBI dataset (84). We assumed that the variance of the effect size of an allele of frequency f was proportional to [f(1 − f)]α, where the parameter α measures the relationship between frequency and effect size (85). We performed 100 simulations with α = −1 (the most commonly used model, where each SNP explains the same proportion of phenotypic variance) and 100 with α = −0.45 as estimated for height (85). We then added an equal amount of random noise to the simulated genetic values, so that the SNP heritability equaled 0.5. We tested for association between these SNPs and the simulated phenotypes. Using these results as summary statistics, we computed PRS and Q x tests using the pipeline described above.

Acknowledgments I.M. was supported by a Research Fellowship from the Alfred P. Sloan Foundation (FG-2018-10647) and a New Investigator Research Grant from the Charles E. Kaufman Foundation (KA2018-98559). Skeletal data were collected in collaboration with Brigitte Holt, Markku Niskanen, Vladimir Sladék, and Margit Bernor, with the support of the National Science Foundation (BCS-0642297 and BCS-0642710) and the Grant Agency of the Czech Republic and the Academy of Finland and Finnish Cultural Foundation. We thank Jeremy Berg and Eva Rosenstock for helpful comments on an earlier version of the manuscript. This research made use of the UK Biobank Resource under Application 33923. The project was initially conceived during discussions at the workshop “Human Stature in the Near East and Europe in a Long-Term Perspective” at the Freie Universität Berlin, April 25–27, 2018, organized as part of the Emmy-Noether-Projekt “LiVES” funded by German Research Foundation Grant RO4148/1 (principal investigator, Eva Rosenstock).

Footnotes Author contributions: S.L.C., C.B.R., R.M.M., and I.M. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no competing interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1910606116/-/DCSupplemental.