The starting point for our model was first proposed by Cavalli-Sforza and Edwards [1], and we draw additionally on related models by Nicholson et al. [33] and Coop et al. [48]. Our goal is to provide a statistical framework for inferring population networks that is motivated by an explicit population genetic model, but sufficiently abstract to be computationally feasible for genome-wide data from many populations (say, 10–100). Our primary aim is to represent the topology of relationships between populations, rather than the precise times of demographic events.

Our approach to this problem is to first build a maximum likelihood tree of populations. We then identify populations that are poor fits to the tree model, and model migration events involving these populations. Below, we first describe this approach in an idealized setting, and then describe the modifications necessary for implementation in practice.

Model

In the most simple case, consider a single SNP, and let the allele frequency of one of the alleles at this SNP in an ancestral population be . (We use a lowercase to denote that this is a parameter rather than a random variable. We initially consider distributions conditional on ). Now consider a descendant population . We model , the allele frequency of the SNP in population , as: (1)with (2)where is a factor that reflects the amount of genetic drift that has occurred between the ancestral population and . This Gaussian model was first introduced by Cavalli-Sforza and Edwards [1], and the motivation for this model is outlined in Nicholson et al. [33], if the amount of genetic drift between the two populations is small (at most on a timescale of the same order as the effective population size), then the diffusion approximation to a Wright-Fisher model of genetic drift leads to Equation 2 with , where is the number of generations separating the two populations, and is the effective population size [33]. We do not model the boundaries of the allele frequencies at zero and one, nor do we consider new mutations. This means that this model will be most accurate for alleles that were at intermediate frequency in the ancestral population.

Now consider a descendant population of ; let us call this population , and the allele frequency in the population . Using the same model: (3) (4)where (5)

We can write down the expectation and variance of as: (6) (7)and: (8) (9)We then assume that the amount of genetic drift between all the populations is small. This implies that is well-approximated by in Equation 5, and hence the amount of genetic drift between and is approximately independent of the amount of genetic drift between and [35]. With these simplifications: (10) (11)We thus have a model for , conditional on : (12)

Multiple populations. Now consider a set of four populations, all related to an ancestral population by a tree, as depicted in Figure 1A. Let the allele frequencies in the four populations be denoted , , , and , respectively, and the vector of all four frequencies be . We want to write down a joint distribution for given the tree. We start by writing down the covariance between any two populations with respect to the ancestral allele frequency (i.e. ). This is simply the variance of the common ancestor of the two populations: (13) (14) (15)and so on (Figure 1B). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Simple examples. A. An example tree. B. The covariance matrix implied by the tree structure in A. Note that the covariance here is with respect to the allele frequency at the root, and that each entry has been divided by to simplify the presentation. C. An example graph. The migration edge is colored red. Parental populations for population 3 are labeled and ; see the main text for details. D. The covariance matrix implied by the graph in C; again, each entry has been divided by . The migration terms are in red, and the non-migration terms are in blue. https://doi.org/10.1371/journal.pgen.1002967.g001 Let us use to denote the variance-covariance matrix of allele frequencies between populations implied by the tree. Now, if we knew the value of , we could model as: (16)where and denotes the multivariate normal distribution. The covariance matrix is distinct from the sample covariance matrix; we return to this distinction later. This multivariate normal model was proposed by [28]. Additionally, a multivariate normal model was used by Coop et al. [48] and Weir and Hill [49], although these authors did not explicitly model the variance-covariance matrix in terms of a tree.

Modeling migration. To extend this framework to include migration, we allow populations to have ancestry from multiple parental populations [35], [40]. The contribution of each parental population is weighted; if we assume admixture occurs in a single generation, these weights are related to the fraction of alleles in the descendant population that originated in each parental population. Consider population 3 in Figure 1C (recall that the allele frequency in this population is ). We have labeled the two parental populations and ; let the allele frequencies in these populations be and , respectively. We model as: (17)where . Note that there is some question as to how to weight , the genetic drift specific to population 3. In reality, it comes from three sources: drift since but before the population mixture, drift since but before the population mixture, and drift since the mixture. These three components should have weights , , and 1, respectively. However, the three components are not all separately identifiable. For ease of computation, we estimate only a single drift parameter in this situation, and weight it by (Text S1, Figure S1). Additionally, there is a choice of whether the edge from or the edge from should be considered the “migration” edge; these two possibilities not identifiable in the model. In practice, we set the edge with the largest weight as the non-migration edge, and the other edge (or edges) as the “migration” edge(s). With these simplifications, the variance of can be written in the mixture case as: (18) (19)We can now consider multiple populations related by a graph instead of a tree (Figure 1C). The variance-covariance matrix can be filled in as before, but now including terms for migration (Figure 1D). This model can be written in terms of a directed acyclic graph (the lack of cycles follows from the fact that no population can contribute genetic material to its own ancestor), where the parameters correspond to edge lengths (Text S1). For subsets of up to four populations, this model is closely related to the “ statistics” used as tests for treeness by Reich et al. [37] (Text S1).

Normalization. As described above, depends on the ancestral allele frequency . This means that the true variance-covariance matrix will differ by a scaling factor between SNPs with different values of . In much work on this type of model, investigators have normalized allele frequencies to account for this. One potential normalization is the arcsine square-root transformation [1]. However, a drawback to this normalization is that it is non-linear; the expected value of the allele frequency in the descendant populations is no longer , but pushed towards the boundaries, which could induce spurious correlations between the most drifted populations [50]. Another plausible transformation would be to scale all allele frequencies by , where is the mean allele frequency across populations [19], [33]. Both of these transformations increase the influence of polymorphisms that were rare in the ancestral population. However, these are precisely the loci where the approximation of our model to a true population genetics model is most likely to break down. For these reasons, we choose to work directly with untransformed allele frequencies.

Properties of the sample covariance. In practice, the multivariate normal model in Equation 16 is impractical because we do not know the ancestral values of allele frequencies, but instead only the values in sampled descendant populations. This means that cannot be estimated directly from data. However, consider instead the sample covariance matrix , where , where , is the number of populations, and and are the sample allele frequencies in populations and . is closely related to as follows: (20) (21) (22) (23)In the following section, we will describe how we perform inference based on the sample covariance matrix .

Estimation from finite samples and many SNPs. Now assume that we have genotyped SNPs in each of populations. Our goal is to use these data to estimate the population history graph : that is, is a rooted, directed, acyclic graph with specified branch lengths and mixture weights, as described above and in Text S1. Let the sample allele frequency at SNP in population be . We can estimate the sample covariance matrix : (24)where . The fact that in practice we have finite samples from each population and a finite number of SNPs has two effects on this matrix. First, sampling leads to a biased estimation of the terms on the diagonal, since sampling can be thought of as adding an amount of “drift” to each population (as well as smaller effects on the off-diagonal terms; see Text S1 for details). Additionally, it leads to some uncertainty in all terms. To account for the biased diagonal terms, in practice we calculate a corrected version of that removes this bias (Text S1). To account for uncertainty in the values of this matrix, we use a block resampling approach (see below). Finally, with multiple SNPs, we are working with SNPs with many different values of . In this case, the terms described above can be thought of as ; i.e., the mean across SNPs of . We now want to write down a likelihood for given (which in turn implies the expected sample covariance matrix ). One possibility would be to use the Wishart distribution, since the sample covariance matrix of multivariate normal random variables has this form. However, computation of the Wishart density involves computationally-intensive matrix inversion, so we took an alternative approach. Consider the observed covariance between populations and , . If we had a large number of independent genomic regions and estimated this quantity separately in each independent region, the sampling distribution would be approximately normal with mean (by appeal to the central limit theorem). We thus model as: (25)where is the standard error in the estimation of . Because the allele frequencies at nearby SNPs are correlated (i.e., there is linkage disequilibrium), a naive estimate of that treated each SNP as independent would be too small. We instead take a resampling approach to estimate . We split the genome into blocks, such that there are SNPs per block (with chosen so that the block sizes are larger than blocks of linkage disequilibrium) [36]. (If does not divide evenly into , we discard the remaining SNPs.) We then calculate separately in each block. Let be the sample covariance between two populations and in block . Now, (26)and (27)If we take each pair of populations in turn, we can write down a composite likelihood for : (28)where is a Gaussian density with mean (computed from by Equation 23) and variance evaluated at .

Measuring model fit. Finally, we wanted to define measures for how well the model fits the data. First, we define the matrix of residuals in this model, . These quantities are useful for visualization and fitting: (29)Positive residuals indicate pairs of populations where the model underestimates the observed covariance, and thus populations where the fit might be improved by adding additional edges. Negative residuals indicate pairs of populations where the model overestimates the observed covariance; these are a necessary outcome of having positive residuals, but can also sometimes be interpreted as populations that are forced too close together due to unmodeled migration elsewhere in the graph. These residuals can be used to define a measure of the fraction of the variance in that is explained by . Let us call this fraction : (30)where and . This quantity approximates the fraction of the variance in relatedness between populations that is accounted for by the model.

Implementation. We implemented an algorithm, called TreeMix, that searches for the graph that maximizes the composite likelihood in Equation 28. A search that enumerates all graphs is infeasible unless is very small, so to simplify the search we make the assumption that the history of the sampled populations is approximately tree-like. We thus start by searching for the maximum likelihood tree, taking an algorithmic approach similar to Felsenstein [30]. After building the tree, we fix the position of the root. (In the tree model the position of the root is not identifiable, as the evolution of allele frequencies along the tree is reversible under the Gaussian model when drift is assumed to be small. In a graph model, though the position of the root is partially identifiable, in all applications we assume that the position of the root is fixed using prior information about known outgroups). We then calculate the residual covariance matrix, , and add migration edges in a directed matter. First, we find the pairs of populations with the maximum residuals. We then attempt adding a migration edge between populations in the vicinity of each of the population pairs. For each attempted graph (or tree) topology, we optimize the branch lengths and migration edge weights (Methods). After finding the single migration edge that most increases the likelihood, we attempt a series of local changes to the graph structure (Methods). We then iterate over this procedure to add additional migration edges. In principle, migration edges could be added until they are no longer statistically significant (see the following paragraph). In our experience, however, we prefer to stop adding migration events well before this point so that the resulting graph remains interpretable.