Abstract The founding of New World populations by Asian peoples is the focus of considerable archaeological and genetic research, and there persist important questions on when and how these events occurred. Genetic data offer great potential for the study of human population history, but there are significant challenges in discerning distinct demographic processes. A new method for the study of diverging populations was applied to questions on the founding and history of Amerind-speaking Native American populations. The model permits estimation of founding population sizes, changes in population size, time of population formation, and gene flow. Analyses of data from nine loci are consistent with the general portrait that has emerged from archaeological and other kinds of evidence. The estimated effective size of the founding population for the New World is fewer than 80 individuals, approximately 1% of the effective size of the estimated ancestral Asian population. By adding a splitting parameter to population divergence models it becomes possible to develop detailed portraits of human demographic history. Analyses of Asian and New World data support a model of a recent founding of the New World by a population of quite small effective size.

Citation: Hey J (2005) On the Number of New World Founders: A Population Genetic Portrait of the Peopling of the Americas. PLoS Biol 3(6): e193. https://doi.org/10.1371/journal.pbio.0030193 Academic Editor: Andy G. Clark, Cornell University, United States of America Received: July 12, 2004; Accepted: March 25, 2005; Published: May 24, 2005 Copyright: © 2005 Jody Hey. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Competing interests: The author has declared that no competing interests exist. Abbreviation: IM, isolation with migration

Discussion The method described is one of several new approaches that can glean information about ancestral population sizes [30,45–47]. By including a new parameter for population splitting, it is possible to generate estimates not only of the size of the ancestral population, but also of the founding size of each founder population. Taken together, the analyses in this study suggest a recent founding of the New World Amerind-speaking peoples by a small population of effective size near 70, followed by population growth in the New World. It is interesting that the analyses do not suggest much population size change in Asia since the time of the founding of the New World population. Given the very broad distributions for θ 1 , it is possible that the true value of this parameter is much higher than suggested by the peak location, and that there has been considerable population growth in Asia. The analyses reveal very broad distributions for migration parameters, and although the peak locations suggest that gene flow has been fairly high (2Nm values greater than 1; see Table 3), the estimated probabilities of migration rates having been zero are also high (Figure 3G and 3H). Also, because Eskimo-Aleut and Na Déne speakers were not included in this study, the question of separate migrations for these groups has not been addressed [3]. As parameter-rich as the method is, neither this nor any mathematical model can be expected to fully represent the complex history of two related populations. However, the same is essentially true of narrative models, as investigators are always constrained by limited data and the need to keep explanations as simple as possible given their data. In this light, the IM model provides a fairly complete framework for some oft-debated questions on human history. With the addition of a new parameter, the IM framework can now also be used to address questions about the founding size of populations and of population size change. In the context of human demographic history, the most problematic assumption under the IM model is that each population is panmictic. Certainly this is not the case today, and it is likely to have even been less true in times past. This raises the general and important question of how local patterns of population structure affect regional or global estimates of diversity [44,48,49]. Although this question cannot be answered here, the analyses do suggest that some kinds of departures from panmixia have not occurred. For example, if the New World had been founded by a local population that had long been separated from other Asian populations, then the estimate of t would be expected to reflect this older population structure, rather than the founding of the New World. Our generally low estimates of t argue against this scenario. Similarly, if the sampled Asian populations had been highly structured, with many long-separated local populations, then this would have inflated the estimates of N A and N 1 , respectively. However, the generally low estimates of effective population size argue against this particular kind of population structure. The analyses presented here share with some other genetic studies estimated dates for the peopling of the Americas that are more recent than archeologically based estimates [8,9,16]. However, the difficulty of estimating such recent events using genetic data alone should not be overestimated [18]. When considering human populations within the past few tens of thousands of years, two gene copies that share the same haplotype will often have had a common ancestor far longer ago than any of the dates in question. Similarly, genetic evidence on the peopling of the Americas has been interpreted both as consistent with multiple migrations [12] and as indicating just a single founder event [16,19,50]. Divergent interpretations are understandable, given that a finding of two populations that share sequence haplotypes at a locus can be taken as evidence of two quite different models: (1) a recent population separation; or (2) gene exchange between populations. The available data do not yet allow precise estimates of founding time nor of whether there has been gene flow between the New World and Asia following the initial founding event. However, the new method implements a parameter-rich model of divergence and has the potential to recover the history of complex divergence processes. The method can also be applied to a large number of loci, with large sample sizes, and in the future can be expected to provide increasingly detailed portraits of human population divergence.

Materials and Methods Selected loci and samples. Given the prevailing model of the founding of New World populations via a Bering land bridge, the descendant populations were defined as the Amerind speakers of the New World and the peoples of northeastern Asia. Greenberg et al. [3] proposed that New World populations include three linguistic groups (Eskimo-Aleut, Na Déne, and Amerind), each associated with a separate episode or period of migration. Because of the limited number of published comparative DNA sequence studies that include samples from Eskimo-Aleut and Na Déne group, the present study was limited to samples from Amerind-speaking populations. Asian samples were limited to those from China, Mongolia, Korea, and Siberia. These are partly arbitrary boundaries selected as a balance between the need to include as many loci as possible and uncertainty about the present locations of descendants of those Asian populations that gave rise to the founders of the New World. The model fitting requires data from loci that do not show evidence of recombination and that do not show clear evidence of directional or balancing selection. All available datasets from the literature that met these criteria and that had multiple DNA sequences from both of the designated sample regions were selected. The selected loci are listed in Table 1. The input data file is provided in Dataset S1, and a list of sample locations is provided in Protocol S1. Model development. At the center of the method for estimating the parameters is an expression for the posterior probability distribution of model parameters Θ, given the data. For the case of multiple loci where Θ refers to the vector of parameters of the model, X i refers to the data for locus i, and G i is the genealogy for locus i [33]. With n loci, the full set of parameters includes six or seven demographic parameters, depending on the inclusion of s, as well as n locus-specific mutation rate scalars [33]. A genealogy includes the topology of an ultrametric tree, the associated coalescence times, and the times of migrations on each branch of the tree [30]. For a given locus i, the probability f(X i |G i ) is calculated using the mutation model for that locus and the branch lengths in the genealogy. The probability f(G i |Θ) is calculated using expressions from basic coalescent theory [30,51–55]. By integrating over all possible genealogies that are consistent with the data, the results obtained are not conditioned on any particular estimate of the genealogy, and they necessarily incorporate all of the stochastic variance that arises among independent loci under the model. The integration in Equation 1 cannot be solved directly for any but the simplest of models, but it can be approximated using a Markov chain simulation [56]. This approach was originally applied to the IM model by Nielsen and Wakeley [30], and then augmented to include multiple loci [33] and additional mutation models [32,57]. Over the course of a simulation the genealogy for a given locus varies for topology, branch lengths, and migration times. However, the probability of the data for a locus given a particular genealogy, f(X i |G i ), depends only upon the branch lengths and the mutation model for that locus [30]. Although inclusion of s will affect the genealogies that arise in the course of the simulation, there will be no effect on the calculation of the probability of the data for a given genealogy (i.e., f(X i |G i ) is not a function of s), and thus including s has no effect on the applicability of the method to diverse mutation models. In contrast, the probability of a genealogy given a set of parameter values, f(G i |Θ), depends strongly on s because the probability of individual coalescent and migration times are functions of population size. The calculation of f(G i |Θ) is most directly done by taking the product of the probabilities of each of the coalescent and migration events that occur in the genealogy. Griffiths and Tavare [55] developed the general theory for the probability distribution of coalescence times when the population size is changing. Given a function v(τ)= N τ /N 0 of the population size at time τ, relative to that at time 0, they provide a general expression for the distribution of coalescent times. For population 1, the effective size goes from N 1 at time zero, to sN A at time t. If it is assumed that the size change is exponential over this period, then for population 1, and for population 2, One additional complication that arises is that when the population is growing exponentially back into the past (decreasing in size as time moves forward), there is a finite probability that the time to coalescence will be infinity [58]. Thus, for population 1 when sN A is less than N 1 , it is necessary to calculate the probability of coalescence time conditioned on there being a coalescent event. Migration under an exponentially changing population size can also be incorporated under this same framework with two changes. First, unlike coalescence, where the rate is inversely proportional to population size, the rate of migration is directly proportional to population size. Second, as time goes backward in the coalescent, the migration rate from population 1 to population 2 (i.e., m 1 ) actually corresponds to the movement of genes from population 2 to population 1 as time moves forward. This means that in the coalescent under changing population size, we expect that the migration rate from population 1 to 2 will vary with the size of population 2. Thus the corresponding relative rate function for migration from population 2 to population 1 is and for migration in the reverse direction it is These intensity functions for coalescence and migration were used to develop an expression for f(G i |Θ) that includes s, and that could be directly incorporated into the update criteria for all of the demographic, mutation, and inheritance scalars described in Hey and Nielsen [33]. Also needed, in order to allow for changing population size, are the update criteria for s and the update criteria for the genealogies. For s, updates are drawn from a uniform distribution over the user-specified prior range (e.g., in the current study, an interval within the range of 0.5 to 1). An update from s to s* will affect the probability of all genealogies and thus has an acceptance probability, under the Metropolis Hastings criterion, of where n is the number of loci and G i is the current genealogy for locus i (see Equation 3 in Hey and Nielsen [33]). If we assume a uniform prior distribution for s, such that the prior probability of s, f(s), is constant for all s, and if we choose updates such that the q(s* → s) = q(s → s*) [30], then this simplifies to For genealogy updates the same proposal distribution of genealogies that was used in the case without s was retained, and then this proposal distribution was incorporated into the update criteria [59]. If f(G i |s) denotes the probability of the genealogy for locus i, given the other parameters including s, and f(G i ) is the Hastings term for the proposal probability of the genealogy for locus i, given the other parameters1 excluding s, then the update criteria for the genealogy for locus i is Performance. The IM computer program [33] was modified to include the additional parameter. The program is available from http://lifesci.rutgers.edu/~heylab/HeylabSoftware.htm#IM. For the Markov chain simulation that is implemented by the program, it is difficult to assess how well the method works, because of the need to generate large numbers of simulated datasets and because of the long run times required [33]. To conduct testing, a program was written to generate simulated datasets under the models in Figure 1. Datasets were simulated in groups of 10 or 20, each having 10–20 loci, for a given set of parameter values, and for a range of parameter values. Figure 4 shows the marginal posterior densities estimated from each of 20 independent simulations for a case of modest population growth with the following parameter values. θ 1 = 10; θ 2 , = 10; θ A = 10; t = 2.5; s = 0.2; m 1 = 0.04; and m 2 = 0.1. For each parameter, the mean of the 20 estimates is shown, and in general these are fairly close to the true value, though there is considerable variance for the peak locations in individual runs. To test whether the locations of these distributions are consistent with the true values of the parameters (i.e., the values used in the simulations), probabilities were combined by treating each simulation as an independent test of the same hypothesis [60]. For each posterior density p i , i = 1,…,20, is the chance that a parameter value is more extreme (i.e., departs more from the mean of the distribution) than is the actual true value. That is, if x is the area of the curve to the left of the true value then p i = 2x if x < 0.5 and p i = 2(1 − x) if x > 0.5. If the p i 's are uniformly distributed, then the quantity PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. The Marginal Densities Obtained by Fitting the Model with Population Size Change to Simulated Data The input parameters for the simulations were as follows: (A) θ 1 = 10; (B) θ 2 = 10; (C) θ A = 10; (D) t =2.5, (E) s = 0.2, (F) m 1 = 0.04; (G) m 2 = 0.2 ; and t = 5 (t/2N A = 0.5). For each simulated dataset, coalescent simulations were done for each of 20 loci with identical mutation rates under an infinite sites mutation model, each with sample sizes of 10 for each of the two populations. Each simulated dataset was analyzed using wide uniform prior distributions for each parameter. Each analysis began with a burn-in period of 300,000 steps followed by a primary chain of 3 million to 10 million steps. The curves for parameters θ 1 through m 2 are shown in (A) through (G), respectively. For each figure, the true parameter value used in the simulations is shown as a black vertical bar, and the mean of the estimates for the 20 simulations (based on peak locations) is shown as a gray vertical bar. https://doi.org/10.1371/journal.pbio.0030193.g004 is χ2 distributed with 40 degrees of freedom (i.e., two times the number of densities). The z values were as follows: θ 1 , 35.5; θ 2 , 26.4; θ A , 41.7; t, 41.1; s, 26.4; m 1 , 29.9; m 2 , 44.1; and the mean of the seven values was 35.0. In the corresponding χ2 distribution, 90% of the probability mass falls above 29.05; 50% falls above 39.3; and 10% falls above 51.8 [61]. Clearly these values are not entirely independent of each other, but they all fall in the middle of the χ2 distribution with a mean (35.0) close to the 50% point of the χ2 distribution (39.3). From these simulations, and many others (additional results provided in Protocol S1), it is clear that sample sizes do need to be large for the posterior distributions to be informative. With data from fewer than five loci or fewer than ten individuals per population per locus, it is often the case that distributions are very flat or that there are multiple peaks. There is a tradeoff in sampling effort required for different kinds of histories. When t is small, sampling effort should be shifted to larger sample sizes per locus, whereas when t is large, sampling effort should be shifted toward more loci. This tradeoff is a byproduct of the fact that the stochastic variance among loci, that is associated with coalescent and migration events in genealogies at times near t, goes up as t increases. Another tradeoff that arises is between s and the migration rate parameters. Just as the frequency of polymorphic sites can be used to estimate changes in population size [62], it can also be appreciated that the information for s must reside in the distribution of times of node intervals in the descendant populations. Migration can have dramatic effects on node interval times within populations. In practice, via simulation, the method does not resolve a sharp peak for s for populations that have had more than moderate amounts of migration (e.g., 2Nm values are greater than 0.5; see Protocol S1). Analyses. Each of the three analyses were done using at least three independent runs, with ten or more independent chains under Metropolis coupling [33] as described by Geyer [63]. Each chain was initiated with a burn-in period of 100,000 updates, and the total run length of each analysis was between 10 million and 30 million updates. The mixing properties of individual runs were monitored by measuring the autocorrelation of individual parameters over the course of the run, and by estimating the effective sample size for each of the parameters as a function of the autocorrelation estimates (see p. 499 in [64]). Analyses were taken to have converged upon the stationary distribution if independent runs generated similar distributions, with each having a lowest effective sample size of 50 for the time parameter (the parameter to show the slowest rate of mixing). To convert estimates of parameters that include the mutation rate to more easily interpreted units, a value of 6 million y since the splitting of human and chimpanzee lineages was used [65–69]. The geometric mean of the human-chimpanzee DNA sequence divergence of each locus, except ATM (see Table 2), was calculated and then used as a molecular clock calibration for converting the estimate of the time parameter, t, to divergence in years. The geometric mean mutation rate across these loci was estimated to be 4.66 × 10−6 mutations per year. The geometric mean is used rather than an arithmetic mean, because under the multilocus model, the mutation rate by which demographic parameters are scaled is the geometric mean of the individual locus-specific mutation rates [33]. To convert the estimates of the population mutation rate parameters (θ 1 , θ 2 , and θ A ) to estimates of effective population size (N 1 , N 2 , and N A , respectively) a measure of mutation rate on a scale of generations is needed. Thus, an assumption was made of 20 y per generation, and the geometric mean divergence between humans and chimpanzees for each species contrast was divided by 12 million y then multiplied by 20 y per generation. These calculations yielded a geometric mean value of 9.32 × 10−5 mutations per generation. These mutation rate values were then used to convert individual θ estimates to effective population size estimates (i.e., θ = 4Nu, and N = θ/4u). Migration parameters in the model can be used to obtain population migration rate estimates (i.e., M = 2Nm, the product of the effective number of gene copies and the per gene copy migration rate) using an estimate of the population mutation rate (θ = 4Nu). Thus θ × m/2 = (4Nu × m/u)/2 = 2Nm [32].

Supporting Information Dataset S1. Peopling of Americas Data File: Nine Loci This is the input file that contains all of the data and that was analyzed using the IM computer program. https://doi.org/10.1371/journal.pbio.0030193.sd001 (582 KB TXT). Protocol S1. Additional Simulations and List of Sample Locations https://doi.org/10.1371/journal.pbio.0030193.sd002 (92 KB DOC).

Acknowledgments John Wakeley, Tad Schurr, and David Meltzer provided input on an early draft of the paper. Rasmus Nielsen provided some helpful suggestions on parameter updating. Thanks also to three reviewers for very helpful suggestions and critique.

Author Contributions JH conceived and designed the model and analyses, selected the datasets, wrote the computer programs, performed the analyses, and wrote the paper.