Under a complex demographic model that recapitulates the human expansion out of Africa and with empirical and variable recombination rates, age estimation in GEVA maintained a similarly high level of accuracy (ε = 0.198, ρ = 0.937; Fig 2B ). In this situation, although PSMC modeled the more dynamic demographic histories between haplotypes with higher accuracy (correlation of true and inferred TMRCA for discordant pairs: ρ = 0.915) compared to GEVA (ρ = 0.775), the time discretization resulted in artifacts at more recent times (TMRCA for concordant pairs: ρ = 0.892) that were not present in GEVA (ρ = 0.932), leading to worse performance when estimating the age (ε = 0.409, ρ = 0.900; S2 Fig ), with the addition of substantial computational cost. We next introduced realistic data complications by reproducing empirically estimated genotype errors in simulated data (see S2 Text ), as well as errors arising through in silico haplotype phasing ( Fig 2B ). We found age estimation in GEVA to remain largely unbiased and strongly correlated with true age after the inclusion of data error (ε = 0.346, ρ = 0.925; S3 Fig ) and after phasing (ε = 0.430, ρ = 0.921; S4 Fig ). The PSMC-based approach continued to show higher bias and reduced correlation at the same set of variants, both after error (ε = 1.042, ρ = 0.882) and after phasing (ε = 1.009, ρ = 0.880). Reduced data quality resulting from sequencing errors may introduce false signals of pairwise differences seen between haplotypes, and phasing errors may lead to an underestimation of haplotype lengths at variants that are relatively young, for which we overestimate TMRCA and hence allele age (particularly for alleles younger than approximately 100 generations).

We found that PSMC-based age estimations performed similarly well to GEVA (ρ = 0.952), though the time discretization increased bias (ε = 0.530) and, in particular, led to overestimation of the age for the youngest variants. We note that PSMC was not designed strictly for this purpose and hence is not optimized for estimating allele age. Conversely, pairwise estimates of the TMRCA between concordant haplotypes were highly correlated with true TMRCA in both GEVA (ρ = 0.922) and PSMC (ρ = 0.919), but the correlation for discordant pairs was lower in GEVA (ρ = 0.586) compared to PSMC (ρ = 0.766); see S1 Fig . Such differences in relation to estimating allele age with high accuracy are tolerated because the time of mutation is estimated from the composite distribution of TMRCA posteriors from many pairwise comparisons performed at a single locus. Events that occurred distant in time (relative to the time of mutation) will have little influence on the estimate.

(A) Density scatterplots showing the relationship between true allele age (geometric mean of lower and upper age of the branch on which a mutation occurred; x axis) and estimated allele age (y axis), using GEVA with the in-built HMM methodology (left) and PSMC (right) for the same set of 5,000 variants. Data were simulated under a neutral coalescent model with sample size N = 1,000, effective population size N e = 10,000, and with constant and equal rates of mutation (μ = 1 × 10 −8 ) and recombination (r = 1 × 10 −8 ) per site per generation. Variants were sampled uniformly from a 100-Mb chromosome, with allele count 1 < x < N. Colors indicate relative density (scaled by the maximum per panel). Upper inserts indicate the fraction of sites where the point estimate (mode of the composite posterior distribution) of allele age lies above the upper age of the branch on which it occurred (^), below the lower age (˅), or within the age range of the branch (∘). Lower inserts indicate the Spearman rank correlation statistic ρ, squared Pearson correlation coefficient (on log scale) r 2 , interval-adjusted bias metric (see S2 Text ) ε, and RMSLE. Also shown is an LOESS fit (second-degree polynomials, neighborhood proportion α = 0.25; dashed line). (B) The relationship between true and inferred ages for 5,000 variants sampled uniformly from a simulation under a complex demographic model with N = 1,000, N e = 7,300, μ = 2.35 × 10 −8 , and variable recombination rates from human chromosome 20 (63 Mb). Allele age was estimated on haplotype data as simulated and without error (top), with error generated from empirical estimates of sequencing errors (middle), and with additional error arising from in silico haplotype phasing; see S2 Text . Allele age was estimated using scaling parameters as specified for each simulation. A further breakdown of results using mutation and recombination clocks alone, as well as the inferred pairwise TMRCAs, is available for A ( S1 Fig ) and B ( S2 Fig , S3 Fig , S4 Fig ). GEVA, Genealogical Estimation of Variant Age; HMM, hidden Markov model; LOESS, locally estimated scatterplot smoothing; PSMC, pairwise sequentially Markovian coalescent; RMSLE, root mean-square log 10 error; TMRCA, time to the most recent common ancestor.

To validate GEVA, we performed coalescent simulations under different demographic models; see Fig 2 . Using a standard coalescent model with constant mutation and recombination rates, we found low bias (relative error, ε = 0.268; see S2 Text ) for allele age estimates and high correlation between true and inferred age (Spearman's ρ = 0.953; Fig 2A ). We compared our approach for estimating the TMRCA to the computationally more demanding pairwise sequentially Markovian coalescent (PSMC) methodology [ 13 ], which forms the basis of many applications in ancestral inference [ 14 , 26 ]. PSMC estimates a model of the demographic history between pairs of chromosomes (over a discretized grid of time intervals) and can, for every position in the genome, return the inferred posterior distribution on the TMRCA, thus enabling a composite-likelihood estimator of allele age as in GEVA.

Finally, we considered the variant rs80194531, in which the derived allele causes an Asn78Thr substitution in the Zinc finger E-box–binding homeobox 1 (ZEB1) gene. The variant is reported as pathogenic for corneal dystrophy [ 43 ] but is present at 6% in African ancestry samples within the TGP or the SGDP. We estimated the age of the variant to be 5,866 generations old (120,000 to 180,000 years), again with consistency between the TGP and the SGDP (5,879 and 5,854 generations, respectively; Fig 3C ; https://human.genome.dating/snp/rs80194531 ). Such an ancient age seems inconsistent with the reported dominant pathogenic effect [ 43 ]. Moreover, of the 1.2 million variants found at comparable frequencies (5%–7%) in African ancestry individuals within the TGP, we found that 46% were estimated to be younger than the rs80194531 allele, suggesting that this variant is in no way unusual.

We next considered the protein-coding missense variant rs3827760 in the Ectodysplasin A Receptor (EDAR) gene, where the derived G allele (Val370Ala substitution) is found at high frequency in East Asian populations (87% in the TGP, 82% in the SGDP) and American populations (39% and 80%, respectively) and is associated with sweat, facial and body morphology, and hair phenotypes [ 41 , 42 ]. We estimated the variant to be 1,456 generations old, approximately 29,000 to 44,000 years ( Fig 3B ; https://human.genome.dating/snp/rs3827760 ), again with strong concordance between the TGP (1,513 generations) and the SGDP (1,346 generations). Our estimate is consistent with previous estimates and limited evidence from ancient DNA studies [ 7 , 44 ]. Our results further show that most individuals carrying the allele share a common ancestor close to the time the allele arose through mutation. This may suggest that the variant rapidly rose in frequency following its origin, which is consistent with previous findings of strong positive selection of this variant in East Asia [ 41 ]. We compared this result to variants of similar age (1,450 ± 100 generations) in the Atlas of Variant Age (see further below): we found 1,043,376 variants dated in the TGP, of which 2,073 have reached a frequency higher than 30% globally and only 130 above 80% in East Asian populations, demonstrating how unusual such a rapid rise in frequency is.

(A) Estimated TMRCAs for concordant (left) and discordant (right) pairs of chromosomes for the derived T allele at rs182549, which lies within an intron of MCM6 and affects regulation of LCT [ 33 ], which encodes lactase. Each bar reflects the approximate 95% credible interval (ETPI) for a pair, ordered by posterior mean (black dots). Data from the TGP (green) [ 2 ] and the SGDP (orange) [ 36 ] were used. The frequency of the variant in the SGDP, the TGP, and the different population groups in the TGP is shown (top left). The inferred allele age in generations from each data source and the combined estimate are shown (bottom right) and converted to an approximate age in years, assuming 20–30 years per generation. See https://human.genome.dating/snp/rs182549 for additional results. (B) As for panel A for the derived G allele of rs3827760, which encodes the Val370Ala variant in EDAR and is associated with sweat and facial and body morphology [ 41 , 42 ]; also see https://human.genome.dating/snp/rs3827760 . Our filtering approach is to remove the smallest number of concordant and discordant pairs necessary (shown in pink) to obtain concordant and discordant sets with nonoverlapping mean posterior TMRCAs. (C) As for panel A for the derived C allele of rs80194531, which encodes the Asn78Thr substitution in ZEB1, reported as pathogenic for corneal dystrophy [ 43 ]; also see https://human.genome.dating/snp/rs80194531 . Abbreviations refer to ancestry groups. AFR, African; AMR, American; EAS, East Asian; EDAR, Ectodysplasin A Receptor gene; ETPI, equal-tailed probability interval; EUR, European; GEVA, Genealogical Estimation of Variant Age; LCT, Lactase gene; MCM6, Minichromosome Maintenance Complex Component 6 gene; SAS, South Asian; SGDP, Simons Genome Diversity Project; TGP, 1000 Genome Project; TMRCA, time to the most recent common ancestor; ZEB1, Zinc finger E-box–binding homeobox 1 gene.

To evaluate the performance of GEVA on empirical data, we first considered variants affecting the well-studied lactase persistence (LP) trait, for which numerous approaches, including the use of archaeological data, genetic data, and a biological understanding of the functional and evolutionary impact of previously associated variants, have resulted in consensus expectations for the age. The LCT gene encodes the lactase enzyme but is regulated by variants in an intron of the neighboring MCM6 gene (Minichromosome Maintenance Complex Component 6). We estimated the age of the derived T allele of the rs182549 variant (G/A-22018), which is at a frequency of approximately 50% in European populations and forms part of a haplotype associated with LP [ 33 ]. Under a model that jointly considers mutational and recombinational information, we estimated the allele to be 688 generations old ( Fig 3A ), originating approximately 14,000 to 21,000 years ago, depending on assumptions about generation time in humans [ 34 , 35 ]. Our estimate is based on data from two different sources, the 1000 Genomes Project (TGP) [ 2 ] and the Simons Genome Diversity Project (SGDP) [ 36 ], which, when estimated separately, give very similar ages (692 and 687 generations, respectively). The full result data set for this variant is available online: https://human.genome.dating/snp/rs182549 . We obtained a similar age estimate of 693 generations for the derived A allele of the rs4988235 (C/T-13910) variant (see https://human.genome.dating/snp/rs4988235 ), which is also strongly associated with LP and in near perfect association with rs182549, though we note that there is evidence for multiple origins of the variant [ 37 ]. Previous estimates of the age of these variants range between 2,200 and 21,000 years [ 38 ], putting our estimates on the higher end of this range. Multiple sources of information suggest that these variants only achieved high frequency in European populations within the last 10,000 years (approximately 400 generations) [ 39 ] and that LP alleles were rare until the advent of dairy farming in Europe [ 40 ]. Our results therefore suggest that the mutation conferring the strongly selected phenotype (estimated to have a selection coefficient of up to 15% in European and up to 19% in Scandinavian populations [ 39 ]) was present for hundreds of generations before its rapid sweep through the population.

Such heterogeneity in the relationship between allele age and frequency, coupled with heterogeneous and unknown sampling strategies, complicates the use of frequency as a means of assessing variants for potential pathogenicity during the interpretation of individual genomes. The Atlas of Variant Age potentially offers a more direct approach for screening variants, given the high probability of elimination of nonrecessive deleterious variants within a few generations [ 50 ]. To assess the value of allele age in the interpretation of potentially pathogenic variants, we estimated the ages of variants that had effects predicted as damaging by Polymorphism Phenotyping v2 software (PolyPhen-2) [ 51 ] or deleterious by Sorting Intolerant From Tolerant software (SIFT) [ 52 ] in the TGP ( Fig 4B ). Of the approximately 70,000 variants analyzed, 50% of damaging and 49% of deleterious variants were estimated to have arisen within the last 500 generations (10,000 to 15,000 years), compared to 41% of benign (PolyPhen-2) and 42% of tolerated (SIFT) variants ( S7 Fig ). Compared with control sets of variants (those annotated as benign or tolerated and matched for allele frequency within the focal ancestry group), variants annotated as damaging or deleterious had a notable dearth of older variants (>1,000 generations) for a given frequency, consistent with theoretical expectations and previous findings [ 19 , 53 , 54 ]. Our results suggest that old alleles can largely be excluded from consideration of pathology (though recent origin is not evidence in favor of pathogenicity).

(A) The relationship between estimated allele age and frequency as observed within a given population group in the TGP sample. Of the 45.4 million variants available in the Atlas of Variant Age, 43.2 million were dated using TGP data alone; we excluded variants with low estimation quality and inconsistent ancestral allele information (see S3 Text ), retaining 34.4 million variants. Each line shows the cumulative age distribution of variants within a given frequency bin (see legend) within a population group; circles indicate median and interquartile range. Panels on the left show the frequency-stratified cumulative distribution of estimated age for variants at nonzero frequencies as observed within a given ancestry group. The number of variants available per group is shown (top left). Panels on the right show the distributions of geographically restricted variants that only segregate within a group (number of available variants shown on bottom right). A summary of variants shared between different ancestry groups in the TGP is provided in S6 Fig . (B) Differences in allele age distributions for approximately 70,000 variants in the TGP that are annotated as impacting protein function by PolyPhen-2 (left) and SIFT (right), compared to a reference set of variants (those annotated as benign by PolyPhen-2 or tolerated by SIFT), matched for allele frequency within a given ancestry group. These results are presented in more detail in S7 Fig . AFR, African; AMR, American; EAS, East Asian; EUR, European; PolyPhen-2, Polymorphism Phenotyping v2 software; SAS, South Asian; SIFT, Sorting Intolerant From Tolerant software; TGP, 1000 Genomes Project.

We find substantial variation in the relationship between estimated age and allele frequency, depending on the population in which frequency is measured and the geographical distribution of the variant ( Fig 4A ). Variants in African ancestry groups are typically older than in other groups and also have the greatest variance in age for a given frequency. For example, variants below 0.5% (within a given population) have a median age of 670 generations in African ancestry groups, 377 generations in East Asian ancestry groups, and 488 generations in Europeans (see S2 Table ). The age distribution of variants restricted to a particular ancestry group (or shared between them) indicates the degree of connection between populations. For example, there are many variants up to 5,000 generations old (100,000 to 150,000 years) that are restricted to African ancestry groups yet are observed at frequencies up to 10%, but variants in this frequency range that are restricted to East Asian or South Asian ancestry groups are typically under 1,000 generations (20,000 to 30,000 years) or 1,300 generations old (26,000 to 39,000 years), respectively. Conversely, cosmopolitan variants that are shared among every ancestry group are typically older than 2,000 generations (40,000 to 60,000 years) despite being observed at global frequencies below 0.5% ( S6 Fig ). Variants restricted to American ancestry groups are typically younger than 750 generations (15,000 to 22,500 years), consistent with existing knowledge about the settlement of the Americas via the Bering land bridge that connected Asia and North America during the last glacial maximum around 15,000 to 23,000 years ago [ 45 , 46 ]. We note, however, that recent admixture and the sampling strategies of the different data sets [ 47 , 48 ] can have a strong impact on age distributions. For example, variants at high frequency within American populations but that are nevertheless restricted to just American and African populations are, on average, younger than lower-frequency variants (within American populations) with the same geographical restriction ( S6 Fig ). These variants likely arose recently within Africa and entered American populations through admixture, rising to high frequency through population bottlenecks [ 49 ]. Similarly, variants found only within European and African populations but that have a frequency below 0.5% in Europeans are, on average, older than variants observed at higher frequencies in Europeans and older than variants restricted to only Europeans in the same frequency range, suggesting recent gene flow from Africa into Europe.

We next sought to characterize the age distribution of genetic variation across the human genome, for which we applied GEVA to more than 45 million variants identified in the TGP or the SGDP. More than 32 billion haplotype pairs were analyzed to estimate shared haplotype segments and TMRCAs. For variants present in both data sources (13.7 million), we additionally estimated the age by combining pairwise TMRCA distributions that were inferred independently in each sample after confirming that separately obtained age estimates agreed (Spearman’s ρ = 0.862; see S5 Fig ). We make this information, referred to as the Atlas of Variant Age for the human genome, publicly available as an online database ( https://human.genome.dating ). A breakdown by chromosome of the number of variants dated and haplotype pairs analyzed is given in S1 Table . Further details are given in S3 Text .

Shared ancestry

Finally, we investigated the extent to which patterns of sharing of variants of different ages could power approaches for learning about genealogical history. Previous work has highlighted the descriptive value of genetic variants in identifying individuals who share recent common ancestry and patterns of demographic isolation and migration [10, 55, 56], though it has also highlighted the challenges of interpreting the output of approaches such as principal component analysis (PCA) [57, 58]. Conversely, numerous model-based approaches have been developed that use patterns of variant and haplotype sharing to infer underlying demographic parameters [14, 59–61], though these typically make strong simplifying assumptions about the space of possible histories.

Here, we present a nonparametric approach for combining descriptive and inferential approaches to learn about ancestral connections between individual genomes (and groups of individuals) based on variant age information. Unlike existing methods that assign time-invariant ancestry proportions to individual genomes by reference to contemporary populations [9, 62], we can estimate the fraction a given genome shares because of common ancestry with any other genome at different points in time, referred to as the cumulative coalescent function (CCF). We use a fast dynamic-programming approach to estimate a maximum likelihood CCF between any pair or group of individuals (Fig 5A; see S4 Text). Using simulated data, we found that inference of ancestral relationships between individuals (CCFs inferred from estimated allele ages) correctly reflected differences of relatedness among individuals within and between different ancestry groups and revealed patterns qualitatively consistent with past demographic events (see S5 Text), though we note that uncertainty in variant age estimates may cause oversmoothing of coalescent profiles, in particular in the distant past (>20,000 generations ago).

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 5. Age-stratified variant sharing to characterize ancestral relatedness. (A) Overview of approach for estimating the CCF for a pair of haplotypes, the fraction of the genomes of the two samples that have coalesced by a given time. Derived variants within a target genome are identified, their estimated ages are obtained from the Atlas of Variant Age, and their presence (black circles) or absence (white circles) in another (comparator) genome of interest is recorded. Variants are sorted by allele age (indicated by color), t, to obtain a naive maximum likelihood estimate of the CCF, Λ(t), using dynamic programming (assuming independence of variants and ignoring error in variant age estimates). (B) Selected pairwise CCFs for the two haploid genomes (chromosome 5) of individual HG00733 (top: maternally derived; bottom: paternally derived) of a Puerto Rican individual from the TGP compared to 8 haplotypes from 4 individuals, including their mother and father. Maternal and paternal genomes were used for phasing; hence, the inferred parental genomes are the transmitted (and untransmitted) genomes. The CCFs inferred with genomes from the entire TGP sample is shown in S1 Movie. The full result data set for HG00733 is available at https://human.genome.dating/ancestry/HG00733. (C) Inferred genome-wide CCFs (averaged per diploid individual across autosomes) for a Siberian Eskimo from the SGDP (ID: S_Eskimo_Sireniki-1) to all other sampled individuals (top panel). Colors indicate ancestry by geographic region (see legend). The CCF can also be expressed as a CIF (middle panel) to reflect the increase in shared ancestry within a given time period. Each row represents an individual from the SGDP, ordered by the AUC of the CCF and scaled such that the maximum per column is equal to one. The color bar (right) indicates the ancestry group of sorted individuals. The CIF within any time epoch can be expressed as an effective population size (N e equivalent) from the maximum over reference samples, providing a summary of the rate at which common ancestor events occurred (bottom panel). The full result data set for this individual is available at https://human.genome.dating/ancestry/S_Eskimo_Sireniki-1. AFR, African; AMR, American; AUC, area under the curve; CCF, cumulative coalescent function; CIF, coalescent intensity function; EUR, European; SGDP, Simons Genome Diversity Project; TGP, 1000 Genomes Project. https://doi.org/10.1371/journal.pbio.3000586.g005

To illustrate the value of this nonparametric approach in describing the history of individuals and groups, we first considered the coalescent history between a single individual of American (Puerto Rican) ancestry from the TGP (ID: HG00733) and all others in the TGP sample, using GEVA age estimates for variants on chromosome 5 (Fig 5B). As a positive control, we included the parents of HG00733 (HG00732 and HG00731), who reach a CCF of near 1 in the most recent epoch (though note that the parents were used for haplotype phasing, which estimates transmitted haplotypes; hence the CCF reaching 1 rather than the expected one-half). We show the ancestry of HG00733 shared with every individual in the TGP sample in S1 Movie; also see https://human.genome.dating/ancestry/HG00733. Within the first 100 generations, we see additional coalescence with the untransmitted parental chromosomes and other individuals from the Puerto Rican sample. The earliest common ancestry outside Puerto Rico is seen with a European individual in Spain (paternal side; approximately 60 generations ago) and a Mexican ancestry individual (maternal side; approximately 80 generations ago). Coalescence with individuals sampled from outside the Americas occurs further back in time (>100 generations ago), initially with European individuals (predominantly around 300 to 600 generations ago), then uniformly with non-African individuals around 1,500 to 4,000 generations ago, more strongly with African individuals around 5,000 to 15,000 generations ago, and uniformly with all individuals around 20,000 generations ago. Because of the impact of data errors on age estimation of recent variants (highlighted above), the absolute timings of the early events are likely substantially overestimated. However, we expect the relative ordering of events to be robust and consistent among sample comparisons (see S5 Text).

The CCFs to all other members of a reference panel (averaged across all chromosomes in both haploid genomes) provide an overview of the genealogical relationships for a target individual. As an example, we inferred the CCF profiles of a Siberian Eskimo to all other individuals in the SGDP (Fig 5C, top), showing common ancestry to other Central Asian and Siberian individuals within a few hundred generations, substantial common ancestry with American individuals before 1,000 generations, and typically more recent common ancestry with East Asians than West Eurasians or Africans; see https://human.genome.dating/ancestry/S_Eskimo_Sireniki-1. Notably, relatively little additional coalescence is seen during a period around 2,000 to 10,000 generations ago, which is a pattern shared among non-African individuals (also see S1 Movie) and agrees with previous findings of a period of reduced coalescence, peaking 100,000 to 200,000 years ago [14].

The CCF can also be represented as a coalescent intensity function (CIF; see S4 Text), which measures the rate of change of common ancestry over time (Fig 5C, middle). The CIF reveals additional structure; for example, around 3,000 to 20,000 generations ago, those parts of the Siberian Eskimo's genome that have not yet coalesced with other genomes sampled from the same ancestry group have a very low CIF, while the CIF to the African ancestry samples (which have had very little coalescence until this point) is relatively high (though note the absolute rate remains very low over this period).

We can further summarize the coalescent profiles for the target individual by computing the maximum CIF over the sample. This statistic captures properties analogous to the effective population size parameter, N e , in population genetics modeling, in which the expected coalescent intensity is inversely related to population size. We refer to the maximum CIF in the following as the N e equivalent (Fig 5C, bottom), though we note that (in the case in which there are genuine populations) it is likely to be downward biased compared to existing methods; for example, we would expect PSMC [13] to infer absolute values of ancestral population size more accurately. We therefore use N e equivalents to provide a relative summary of genealogical histories across the entire cohort.

We estimated all pairwise CIFs in TGP (S2 Movie), which we aggregated across autosomes to generate a coalescent profile between each pair of individuals (see S6 Text). Likewise, we estimated the ancestry shared between each individual pair in SGDP (S3 Movie); CIFs were further aggregated among the 130 population groups (Fig 6). These reveal how the rates and structure of coalescence have changed over time, with the most recent epoch (up to approximately 200 generations ago) dominated by coalescence within populations, but also identify recent connections between groups such as between southern Siberian and northern East Asian populations (Fig 6A). Several populations such as the Kusunda (Nepal), Saami (Finland), and Negev Bedouins (Israel) show strong within-group coalescence up to this point, though by 500 generations ago they are coalescing primarily with other populations. The epoch around 800 generations ago is dominated by structure broadly corresponding to the continental level (Fig 6B), though some African populations—for example, the Mbuti (Congo), but more dramatically the Khomani San (South Africa) and Ju'hoansi (Namibia)—remain isolated up to approximately 1,500 generations ago (30,000 to 45,000 years), which overlaps with previous findings [63], and we see these two populations to be strongly connected for an extended period further back in time (S3 Movie). Around 800 generations ago, there is very little remaining structure among West Eurasian populations, but many additional intercontinental connections are now identified. For example, we see a north-to-south gradient of decreasing coalescence between American populations and Siberian or East Asian populations. In particular, we identify strong coalescence of American ancestry individuals with Siberian Eskimos, Aleutian Islanders, and Tlingit people in a period between 500 and 1,000 generations ago and very little structure among American, Siberian, and East Asian populations as a whole further back than 1,000 generations ago (S3 Movie), which agrees with previous results regarding the human migration into the Americas, extended isolation, and subsequent dispersal across the continent [46]. By 4,000 generations ago, we see high levels of coalescence between non-African and African populations (Fig 6C) and essentially no structure in the epoch around 20,000 generations ago (Fig 6D). We note that the exact timings of demographic events (signified by periods of intense or reduced coalescence), while showing consistent patterns within and among population groups, may carry additional noise due to uncertainty in allele age estimates.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 6. Age-stratified connections between ancestry groups in the publicly available SGDP sample. The CCF was inferred for all 556 haploid target genomes with all other comparator genomes in the SGDP sample and then aggregated by ancestry group (mean of CCFs from individuals within a population) and across chromosomes, with populations as defined in the SGDP (see legend on the right). (A–D) The ancestry shared between populations is indicated by the CIF over a given time interval (epoch), shown as a matrix with populations sorted from north to south within continental regions. Intensities were computed from aggregated CCFs to summarize relationships between populations; colors indicate intensity scaled per target population (rows) by the maximum over comparator populations. Ancestral connections are shown at different epochs back in time; around 200 generations ago (A), 800 generations (B), 4,000 generations (C), and 20,000 generations (D). The conversion (top right) assumes 20–30 years per generation. A more detailed summary, showing the ancestry shared between individuals, over a sliding time window (epoch) is shown in S3 Movie. (E) The maximum CIF for individuals from different ancestry groups (continental regions) expressed as effective population size (N e ) equivalents over time, estimated from CCFs aggregated per diploid individual and summarized by the median and interquartile range per group. Triangles indicate the epochs shown in panels A–D. A further breakdown of N e equivalents estimated from nonaggregated CCFs per chromosome is shown in S8 Fig. CCF, cumulative coalescent function; CIF, coalescent intensity function; SGDP, Simons Genome Diversity Project. https://doi.org/10.1371/journal.pbio.3000586.g006