We then used PLINK's inference of the fraction of single-marker IBD (Z0, Z1, and Z2; [58] ) to identify very close relatives, finding 25 pairs that are first cousins or closer (including duplicated samples), and excluded one individual from each pair. We grouped samples into populations mostly by reported country, but also used reported language in a few cases. Because of the large Swiss sample, we split this group into three by language: French-speaking (CHf), German-speaking (CHd), or other (CH). Many samples reported grandparents from Yugoslavia; when possible we assigned these to a modern-day country by language, and when this was ambiguous or missing, we assigned these to “Yugoslavia.” Most samples from the United Kingdom reported this as their country of origin; however, the few that reported “England” or “Scotland” were assigned this label. This left us with 2,257 individuals from 40 populations; for sample sizes, see Table 1 . Table S1 further breaks this down, and unambiguously gives the composition of each population. Physical distances were converted to genetic distances using the hg36 map, and the average human generation time was taken to be 30 years [39] .

We used the two European subsets of the POPRES dataset—the CoLaus subset, collected in Lausanne, Switzerland, and the LOLIPOP subset, collected in London, England; the dataset is described in [34] . Those collected in Lausanne reported parental and grandparental country of origin; those collected in London did not. We followed [13] in assigning each sample to the common grandparental country of origin when available, and discarding samples whose parents or grandparents were reported as originating in different countries. We took further steps to restrict to individuals whose grandparents came from the same geographic region, first performing principal components analysis on the data using SMARTPCA [57] , and excluding 41 individuals who clustered with populations outside Europe (the majority of such were already excluded by self-reported non-European grandparents). These individuals certainly represent an important part of the recent genetic ancestry of Europe, but are excluded because we aim to study events stemming from older patterns of gene flow, and because we do not model the whole-genome dependencies in recently admixed genomes.

(A) Bias in inferred length with lines x = y (dotted) and a loess fit (solid). Each point is a segment of true IBD (copied between individuals), showing its true length and inferred length after postprocessing. Color shows the number of distinct, nonoverlapping segments found by BEAGLE, and the length of the vertical line gives the total length of gaps between such segments that BEAGLE falsely inferred was not IBD (these gaps are corrected by our postprocessing). (B) Estimated false positive rate as a function of length. Observed rates of IBD blocks, per pair and per cM, are also displayed for the purpose of comparison. “Nearby” and “Distant” means IBD between pairs of populations closer and farther away than 1,000 km, respectively. Below, the estimated power as a function of length (black line), together with the parametric fit c(x) of equation (1) (red dotted curve).

To find blocks of IBD, we used fastIBD (implemented in BEAGLE; [31] ), which records putative genomic segments shared IBD by pairs of individuals, along with a score indicating the strength of support. As suggested by the authors, in all cases we ran the algorithm 10 times with different random seeds, and postprocessed the results to obtain IBD blocks. Based on our power simulations described below, we modified the postprocessing procedure recommended by [31] to deal with spurious gaps or breaks introduced into long blocks of IBD by low marker density or switch error, as follows: We called IBD segments by first removing any segments not overlapping a segment seen in at least one other run (as suggested by [31] , except with no score cutoff); then merging any two segments separated by a gap shorter than at least one of the segments and no more than 5 cM long; and finally discarding any merged segments that did not contain a subsegment with score below 10 −9 . As shown in Figure 6 , this resulted in a false positive rate of between 8–15% across length categories, and a power of at least 70% above 1 cM, reaching 95% by 4 cM. After postprocessing, we were left with 1.9 million IBD blocks, 1 million of which were at least 2 cM long (at which length we estimate 85% power and a 10% false positive rate).

We then tested the extent to which the effect of sample size varied by population, for IBD blocks in several length categories (binning block lengths at 1, 2, 4, and 10 cM). Suppose that F xy is the number of IBD blocks found between populations x and y in the analysis of the full dataset, and S xy is the number found in the analysis of the smaller dataset (counted between the same individuals each time). We then assume that F xy and S xy are Poisson with mean and , respectively, so that conditioned on N xy = F xy +S xy (the total number of blocks), S xy is binomial with parameters N xy and . We are looking for deviations from the null model under which the effect of smaller sample size affects all population pairs equally, so that for some constant C. We therefore fit a binomial GLM [64] with a logit link, with terms corresponding to each population—in other words: We found statistically significant variation by population (i.e., several nonzero α x ), but all effect sizes were in the range of 0–4%; estimated parameters are listed in Table S2 . Notably, the coefficient corresponding to the French-speaking Swiss (the population with the largest change in sample size) was fairly small. We also fit the model not assuming additivity when x = y—that is, adding coefficients α xx to the formula for p xx —but these were not significant. These results suggest that sample size variation across the POPRES data has only minor effects on the distribution of IBD blocks shared across populations.

Finally, one concern is that as fastIBD calls IBD based on a model of haplotype frequencies in the sample, it may be unduly affected by the large-scale sample size variation across the POPRES sample. In particular, the French-speaking Swiss sample is very large, which could lead to systematic bias in calling IBD in populations closely to the Swiss samples. To investigate this, we randomly discarded 745 French-speaking Swiss (all but 100 of these), and a random sampling of the remaining populations (removing 812 in total, leaving 1,445). We then ran BEAGLE on chromosome 1 of these individuals, postprocessing in the same way as for the full sample. Reassuringly, there was high concordance between the two—we found that 98% (95%) of the blocks longer than 2 cM found in the analysis with the full dataset (respectively, with the subset) were found in both analyses. Overall, more blocks were found by the analysis with the smaller dataset; however, by adjusting the score cutoff by a fixed amount, this difference could be removed, leaving nearly identical length distributions between the two analyses. This is a known attribute of the fastIBD algorithm, and can alternatively be avoided by adjusting the model complexity (S. Browning, personal communication).

We found that overall, the false positive rate was around 1/10th of the observed rate, except for very long blocks (longer than 5 cM or so, where it was close to zero), and for very short blocks (less than 1 cM, where it approached 0.4). As fastIBD depends on estimating underlying haplotype frequencies, it is expected to have a higher false positive rate in populations that are more differentiated from the rest of the sample. There was significant variation in false positive rate between different populations, with Spain, Portugal, and Italy showing significantly higher false positive rates than the other populations we examined (see Figure S11 ). This variation was significant only for blocks shorter than 2 cM across all population pairs, with the exception of pairs of Portuguese individuals, where the upwards bias may be significant as high as 4 cM.

To estimate the false positive rate, we randomly shuffled segments of diploid genome between individuals from the same population (only those 12 populations with at least 19 samples) so that any run of IBD longer than about 0.5 cM would be broken up among many individuals. Specifically, as we read along the genome we output diploid genotypes in random order; we shuffled this order by exchanging the identity of each output individual with another at independent increments chosen uniformly between 0.1 and 0.2 cM. This ensured that no output individual had a continuous run of length longer than 0.2 cM copied from a single input individual, while also preserving linkage on scales shorter than 0.1 cM. The results are shown in Figure 6B ; from these we estimate that the mean density of false positives x cM long per pair and per cM is approximately: (2)a parametric form again chosen by examination of the data and fit by maximum likelihood.

We also need a reasonably accurate assessment of our bias and false positive rates for our inference of numbers of genetic ancestors from the IBD length spectrum. Although the estimated IBD lengths were approximately unbiased, we fit a parametric model to the relationship between true and inferred lengths after removing inferred blocks less than 1 cM long. A true IBD block of length x is missed entirely with probability 1–c(x), and is otherwise inferred to have length ; with probability γ(x), the error is positive; otherwise it is negative and conditioned to be less than x. In either case, is exponentially distributed; if , its mean is , while if , its (unconditional) mean is 1/λ − (x). The parametric forms were chosen by examination of the data; these are, with final parameter values: (1)where z + = max(z,0). The parameters were found by maximum likelihood, using constrained optimization as implemented in the R package optim [59] separately on three independent pieces: the parameters in c(x) and γ(x), the parameters in λ − , and finally the parameters in λ + ; the fit is shown in Figure S10 .

If we are to have a reasonable false positive rate, we must accept imperfect power. Power will also vary with the density and informativeness of markers and length of segment considered. For example, it is intuitive that segments of genome containing many rare alleles are easier to identify as IBD. Conversely, rare immigrant segments from a population with different allele frequencies may, if they are shared by multiple individuals within the population, cause higher false positive rates. For these reasons, when estimating statistical power and false positive rate, it is important to use a dataset as similar to the one under consideration as possible. Therefore, to determine appropriate postprocessing criteria and to estimate our statistical power, we constructed a dataset similar to the POPRES with known shared IBD segments as follows: we copied haploid segments randomly between 60 trio-phased individuals of European descent (using only one from each trio) from the HapMap dataset (haplotypes from release #21, 17/07/06 [63] ), reoriented alleles to match the strand orientation of POPRES, substituted these for 60 individuals from Switzerland in the POPRES data, and ran BEAGLE on the result as before. These segments were copied between single chromosomes of randomly chosen individuals, for random lengths 0.5–20 cM, with gaps of at least 2 cM between adjacent segments and without copying between the same two individuals twice in a row. When copying, we furthermore introduced genotyping error by flipping alleles independently with a probability of .002 and marking the allele missing with a probability of .023 (error rates were determined from duplicated individuals in the sample as given by [34] ). An important feature of the inferred data was that BEAGLE often reported true segments longer than about 5 cM as two or more shorter segments separated by a short gap, which led us to merge blocks as described above.

All methods to identify haplotypic IBD rely on identifying long regions of near identical haplotypes between pairs of individuals (referred to as identical by state, IBS). However, long IBS haplotypes could potentially also result from the concatenation of multiple shorter blocks of true IBD. While such runs can contain important information about deeper population history (e.g., [3] , [24] ), we view them as a false positives as they do not represent single haplotypes shared without intervening recombination. The chance of such a false positive IBD segment decreases as the genetic length of shared haplotype increases. However, the density of informative markers also plays a role, because such markers are necessary to infer regions of IBS.

To look for regions of unusual levels of IBD and to examine our assumption of uniformity, we compared the density of IBD tracts of different lengths along the genome, in Figure S1 . To do this, we first divided blocks up into nonoverlapping bins based on length, with cutpoints at 1, 2.5, 4, 6, 8, and 10 cM. We then computed at each SNP the number of IBD blocks in each length bin that covered that site. To control for the effect of nearby SNP density on the ability to detect IBD, we then computed the residuals of a linear regression predicting number of overlapping IBD blocks using the density of SNPs within 3 cM. To compare between bins, we then normalized these residuals, subtracting the mean and dividing by the standard deviation; these “z-scores” for each SNP are shown in Figure S1 .

We noted repeated patterns of IBD sharing across multiple populations (seen in Figure S3 ), in which certain sets of populations tended to show similar patterns of sharing. To quantify this, we computed correlations between mean numbers of IBD blocks; in Figure S7 , we show correlations in numbers of blocks of various lengths. Specifically, if I(x,y) is the mean number of IBD blocks of the given length shared by an individual from population x with a (different) individual from population y, there are n populations, and , then Figure S7 shows for each x and y: (3)the (Pearson) correlation between I(x,z) and I(y,z) ranging across . Other choices of block lengths are similar, although shorter blocks show higher overall correlations (due in part to false positives) and longer blocks show lower overall correlations (as rates are noisier, and sharing is more restricted to nearby populations). The geographic groupings of Table 1 were then chosen by visual inspection.

We assessed the overall degree of substructure within each population, by measuring, for each x and y, the degree of inhomogeneity across individuals of population x for shared ancestry with population y. We measured inhomogeneity by the standard deviation in number of blocks shared with population y, across individuals of population x. We assessed the significance by a permutation test, randomly reassigning each block shared between x and y to a individual chosen uniformly from population x, and recomputing the standard deviation, 1,000 times. (If there are k blocks shared between x and y and m individuals in population x, this is equivalent to putting k balls in m boxes, tallying how many balls are in each box, and computing the sample standard deviation of the resulting list of numbers.) Note that some degree of inhomogeneity of shared ancestry is expected even within randomly mating populations, due to randomness of the relationship between individuals in the pedigree. These effects are likely to be small if the relationships are suitably deep, but this is still an area of active research [50] , [65] . The resulting p values are shown in Figure S2 . We did not analyze these in detail, particularly as we had limited power to detect substructure in populations with few samples, but note that a large proportion (47%) of the population pairs showed greater inhomogeneity than in all 1,000 permuted samples (i.e., p<.001). Some comparisons even with many samples in both populations (where we have considerable power to detect even subtle inhomogeneity) showed no structure whatsoever—in particular, the distribution of numbers of Italian IBD blocks shared by Swiss individuals is not distinguishable from Poisson, indicating a high degree of homogeneity of Italian ancestry across Switzerland.

Inferring Ages of Common Ancestors

Here, our aim is to use the distribution of IBD block lengths to infer how long ago the genetic common ancestors were alive from which these IBD blocks were inherited. A pair of individuals who share a block of IBD of genetic length at least x have each inherited contiguous regions of genome from a single common ancestor n generations ago that overlap for length at least x. If we start with the population pedigree, those ancestors from which the two individuals might have inherited IBD blocks are those that can be connected to both by paths through the pedigree. The distribution of possible IBD blocks is determined by the number of links (i.e., the number of meioses) occurring along the two paths.

Throughout the article we informally often refer to ancestors living a certain “number of generations in the past” as if humans were semelparous with a fixed lifetime. Keeping with this, it is natural to write the number of IBD blocks shared by a pair of individuals as the sum over past generations of the number of IBD blocks inherited from that generation. In other words, if N(x) is the number of IBD blocks of genetic length at least x shared by two individual chromosomes, and N n (x) is the number of such IBD blocks inherited by the two along paths through the pedigree having a total of n meioses, then . Therefore, averaging over possible choices of pairs of individuals, the mean number of shared IBD blocks can be similarly partitioned as: (4)In each successive generation in the past, each chromosome is broken up into successively more pieces, each of which has been inherited along a different path through the pedigree, and any two such pieces of the two individual chromosomes that overlap and are inherited from the same ancestral chromosome contribute one block of IBD. Therefore, the mean number of IBD blocks coming from n/2 generations ago is the mean number of possible blocks multiplied by the probability that a particular block is actually inherited by both individuals from the same genealogical ancestor in generation n/2. Allowing for overlapping generations, the first part we denote by K(n,x), the mean number of pieces of length at least x obtained by cutting the chromosome at the recombination sites of n meioses, and the second part we denote by μ(n), the probability that the two chromosomes have inherited at a particular site along a path of total length n meioses (e.g., their common ancestor at that site lived n/2 generations ago). Multiplying these and summing over possible paths, we have that: (5)that is, the mean rate of IBD is a linear function of the distribution of the time back to the most recent common ancestor averaged across sites. The distribution μ(n) is more precisely known as the coalescent time distribution [66],[67], in its obvious adaptation to population pedigrees. As a first application, note that the distribution of ages of IBD blocks above a given length x depends strongly on the demographic history—a fraction of these are from paths n meioses long.

Furthermore, it is easy to calculate that for a chromosome of genetic length G: (6)assuming homogeneous Poisson recombination on the genetic map (as well as constancy of the map and ignoring the effect of interference, which is reasonable for the range of we consider). The mean number of IBD blocks of length at least x shared by a pair of individuals across the entire genome is then obtained by summing equation (5) across all chromosomes, and multiplying by 4 (for the four possible chromosome pairs).

Equations (5) and (6) give the relationship between lengths of shared IBD blocks and how long ago the ancestor lived from whom these blocks are inherited. Our goal is to invert this relationship to learn about μ(n), and hence the ages of the common ancestors underlying our observed distribution of IBD block lengths. To do this, we first need to account for sampling noise and estimation error. Suppose we are looking at IBD blocks shared between any of a set of n p pairs of individuals, and assume that N(y), the number of observed IBD blocks shared between any of those pairs of length at least y, is Poisson distributed with mean n p M(y), where: (7) (8)Here the false positive rate f(z), power c(x), and the components of the error kernel R(x,z) are estimated as above, with parametric forms given in equations (2) and (1). The Poisson assumption has been examined elsewhere (e.g., [27],[49]) and is reasonable because there is a very small chance of having inherited a block from each pair of shared genealogical ancestors; there a great number of these, and if these events are sufficiently independent, the Poisson distribution will be a good approximation (see, e.g., [68]). If this holds for each pair of individuals, the total number of IBD blocks is also Poisson distributed, with M given by the mean of this number across all constituent pairs. (Note that this does not assume that each pair of individuals has the same mean number, and therefore does not assume that our set of pairs are a homogeneous population.)

We have therefore a likelihood model for the data, with demographic history (parametrized by ) as free parameters. Unfortunately, the problem of inferring μ is ill-conditioned (unsurprising due to its similarity of the kernel (6) to the Laplace transform, see [69]), which in this context means that the likelihood surface is flat in certain directions (“ridged”): for each IBD block distribution N(x), there is a large set of coalescent time distributions μ(n) that fit the data equally well. A common problem in such problems is that the unconstrained maximum likelihood solution is wildly oscillatory; in our case, the unconstrained solution is not so obviously wrong, since we are helped considerably by the knowledge that μ≥0. For reviews of approaches to such ill-conditioned inverse problems, see, for example, [40] or [70]; the problem is also known as “data unfolding” in particle physics [71]. If one is concerned with finding a point estimate of μ, most approaches add an additional penalty to the likelihood, which is known as “regularization” [72] or “ridge regression” [73]. However, our goal is parametric inference, and so we must describe the limits of the “ridge” in the likelihood surface in various directions (which can be seen as maximum a posteriori estimates under priors of various strengths).

To do this, we first discretize the data, so that N i is the number of IBD blocks shared by any of a total of n p distinct pairs of individuals with inferred genetic lengths falling between x i –1 and x i . We restrict to blocks having a minimum length of 2 cM long, so that x 0 = 2. To find a discretization so that each N i has roughly equal variance, we choose x i by first dividing the range of block lengths into 100 bins with equal numbers of blocks falling in each, discard any bins longer than 1 cM, and divide the remainder of the range up into 1 cM chunks. To further reduce computational time, we also discretize time, effectively requiring μ n to be constant on each interval , with , for 1≤j≤360—so the resolution is finest for recent times, and the maximum time depth considered is 6,660 meioses, or 99,900 years ago. (The discretization and upper bound on time depth were found to not affect our results.) We then compute by numerical integration (using the function integrate in R) the matrix L discretizing the kernel given in equation (7), so that is the kernel that applied to μ gives the mean number of true IBD blocks per pair observed with lengths between x i –1 and x i , and is the mean number of false positives per pair with lengths in the same interval. We then sum across chromosomes, as before. The likelihood of the data is thus: (9)To the (negative) log likelihood we add a penalization γ, after rescaling by the number of pairs n p (which does not affect the result but makes penalization strengths comparable between pairs of populations), and use numerical optimization (the L-BFGS-B method in optim; [59]) to minimize the resulting functional (which omits terms independent of μ): (10)Often we will fix the functional form of the penalization and vary its strength, so that γ(μ) = γ 0 z(μ), in which case we will write for .

For instance, the leftmost panels in Figure 4 show the minimizing solutions μ for γ(μ) = 0 (no penalization) and for (“roughness” penalization). Because our aim is to describe extremal reasonable estimates μ, in this and in other cases, we have chosen the strength of penalization γ 0 to be “as large as is reasonable,” choosing the largest γ 0 such that the minimizing μ has log likelihood differing by no more than two units from the unconstrained optimum. This choice of cutoff can be justified as in [74], gave quite similar answers to other methods, and performed well on simulated population histories (see Text S1). This can be thought of as taking the strongest prior that still gives us “reasonable” maximum a posteriori answers. Note that the optimization is over nonnegative distributions μ also satisfying (although the latter condition does not enter in practice).

We would also like to determine bounds on total numbers of shared genetic ancestors who lived during particular time intervals, by determining, for example, the minimum and maximum numbers of such ancestors that are consistent with the data. Such bounds are shown in Figure 5. To obtain a lower bound for the time period between n 1 and n 2 generations, we penalized the total amount of shared ancestry during this interval, using the penalizations , and choosing to give a drop of 2 log likelihood units, as described above. The lower bound is then the total amount of coalescence for minimizing . The upper bound is found by penalizing total shared ancestry outside this interval—that is, by applying the penalization . It is almost always the case that lower bounds are zero, since there is sufficient wiggle room in the likelihood surface to explain the observed block length distribution using peaks just below n 1 and above n 2 . Examples are shown in Figure S14. On the other hand, upper bounds seem fairly reliable.

In the above we have assumed that the minimizer of is unique, thus glossing over, for example, finding appropriate starting points for the optimization. In practice, we obtained good starting points by solving the natural approximating least-squares problem, using quadprog [75] in R. We then evaluated uniqueness of the minimizer by using different starting points, and found that if necessary, adding only a very small penalization term was enough to ensure convergence to a unique solution.

Testing the method. To test this method, we implemented a whole-genome coalescent IBD simulator, and applied our inference methods to the results under various demography scenarios. We also used these simulations to evaluate the sensitivity of our method to misestimation of power or false positive rates. The simulations, and the results, are described in Text S1; in all cases, the simulations showed that the method performed well to the level of uncertainty discussed throughout the text and confirmed our understanding of the method described above. We also found that misestimation of false positive rate only affects estimated numbers of common ancestors by a comparable amount, and that misestimation of blocks less than 4 cM long mostly affects estimates older than about 2,000 years. Therefore, if our false positive rates above 2 cM are off by 10% (the range that seems reasonable), which would change our estimated numbers of blocks by about 1%, this would only change our estimated numbers of shared ancestors by a few percent.