Abstract A long-debated question concerns the fate of archaic forms of the genus Homo: did they go extinct without interbreeding with anatomically modern humans, or are their genes present in contemporary populations? This question is typically focused on the genetic contribution of archaic forms outside of Africa. Here we use DNA sequence data gathered from 61 noncoding autosomal regions in a sample of three sub-Saharan African populations (Mandenka, Biaka, and San) to test models of African archaic admixture. We use two complementary approximate-likelihood approaches and a model of human evolution that involves recent population structure, with and without gene flow from an archaic population. Extensive simulation results reject the null model of no admixture and allow us to infer that contemporary African populations contain a small proportion of genetic material (≈2%) that introgressed ≈35 kya from an archaic population that split from the ancestors of anatomically modern humans ≈700 kya. Three candidate regions showing deep haplotype divergence, unusual patterns of linkage disequilibrium, and small basal clade size are identified and the distributions of introgressive haplotypes surveyed in a sample of populations from across sub-Saharan Africa. One candidate locus with an unusual segment of DNA that extends for >31 kb on chromosome 4 seems to have introgressed into modern Africans from a now-extinct taxon that may have lived in central Africa. Taken together our results suggest that polymorphisms present in extant populations introgressed via relatively recent interbreeding with hominin forms that diverged from the ancestors of modern humans in the Lower-Middle Pleistocene.

It is now well accepted that anatomically modern humans (AMH) originated in Africa and eventually dispersed to all inhabited parts of the world. What is not known is the extent to which the ancestral population that gave rise to AMH was genetically isolated, and whether archaic hominins made a genetic contribution to the modern human gene pool. Answering these questions has important implications for understanding the way in which adaptations associated with modern traits were assembled in the human genome: do the genes of AMH descend exclusively from a single isolated population, or do our genes descend from divergent ancestors that occupied different ecological niches over a wider geographical range across and outside of the African Pleistocene landscape?

The introgression debate is typically framed in terms of interbreeding between AMH and Neandertals in Europe or other archaic forms in Asia. The opportunity for such hybridizations may have existed between 90 and 30 kya, after early modern humans dispersed from Africa and before archaic forms went extinct in Eurasia (1⇓⇓⇓–5). Recent genome-level analyses of ancient DNA suggest that a small amount of gene flow did occur from Neandertals into the ancestors of non-Africans sometime after AMH left Africa (6) and that an archaic “Denisovan” population contributed genetic material to the genomes of present-day Melanesians (7). Given recent fossil evidence, however, the greatest opportunity for introgression was in Africa, where AMH and various archaic forms coexisted for much longer than they did outside of Africa (5, 8–11). Indeed, the fossil record indicates that a variety of transitional forms with a mosaic of archaic and modern features lived over an extensive geographic area from Morocco to South Africa between 200 and 35 kya (12⇓⇓–15).

Although sequencing the Neandertal and Denisovan genomes has provided evidence that gene flow between archaic and modern humans is plausible, it has not aided efforts to determine the extent of introgression in African populations. Here we use a different strategy to address the question of ancient population structure in Africa. Using multilocus DNA sequence polymorphism data from extant Africans, we analyze patterns of divergence and linkage disequilibrium (LD) to detect the signature of archaic admixture (16⇓–18). Application of this approach to publicly available sequence data from the Environmental Genome Project found evidence of ancient population structure in both African and non-African populations (19). However, analyses of diversity in and around coding regions may be complicated by the effects of recent natural selection, which might contribute to unusual patterns of polymorphism. In this study we use a large resequencing dataset that includes 61 noncoding regions on the autosomes to test whether patterns of neutral polymorphism in three contemporary sub-Saharan African populations are better explained by archaic admixture. Although whole-genome polymorphism data are now available from hundreds of samples (20), they do not include individuals from African hunter-gatherer populations, which serve as important reservoirs of human genetic diversity. Our study includes two such populations (Biaka Pygmies and San), along with an agricultural population from West Africa (Mandenka). We use a model of historically isolated subpopulations (17, 21) to predict patterns of nucleotide variation expected as a consequence of no admixture (null hypothesis) vs. low levels of admixture (alternative hypothesis). We apply two complementary coalescent-based approaches, a two-population and a three-population model to test the null hypothesis, and then estimate three key parameters: the time of admixture (T a ), the ancestral split time (T 0 ), and the admixture proportion (a).

Discussion Our inference methods reject the hypothesis that the ancestral population that gave rise to AMH in Africa was genetically isolated and point to several candidate regions that may have introgressed from an archaic source(s). For example, we identified a ≈31.4-kb region within the 4qMB179 locus with highly diverged haplotypes, one of which is found at low frequency in several Pygmy groups in central Africa. We hypothesize that the unusual haplotype descends from an archaic DNA segment that entered the AMH population via admixture. The observed haplotype structure is highly unusual (P < 5 × 10−5), even when we account for recent population structure or uncertainty in the underlying recombination rate (Table S4). It is noteworthy that the two ends of the archaic haplotype correspond to recombinational hotspots in the 4qMB179 region (Fig. 3B), suggesting that an initially much longer block of archaic DNA was whittled down by frequent recombination in the hotspots. Both inferential methods also identified the 13qMB107 locus as a likely introgression candidate; however, only ≈7 kb of the surveyed region contains SNPs that are in high LD, all of which are found at the 5′ end of the sequenced region in two San individuals. To determine whether the length of the unusual pattern of SNPs extends beyond our sequenced region at 13qMB107, we examined public full genome sequence data (25). We identified a San individual (!Gubi) who carried one copy of the unusual 13qMB107 haplotype and noted a run of heterozygous sites that extended an additional ≈7 kb to the 5′ side of our sequenced region. Like the case of 4qMB179, the two ends of the unusual haplotype correspond to recombinational hotspots, and analysis of 13qMB107 yields an estimated divergence time of ≈1 Mya and a recent introgression time (≈20 kya) (Table 1). The geographic distribution of the introgressive variant at 18qMB60, a third candidate identified in the three-population model (Table 1), is very similar to that of 4qMB179, albeit consistently found at lower frequencies (Fig. 4). On the other hand, the distribution of the introgressive variant at 13qMB107 is distinguished from that of the other two candidate loci by its presence in the San and the southern African Xhosa, as well as in Mbuti from the Democratic Republic of Congo. Interestingly, the Mbuti represent the only population in our survey that carries the introgressive variant at all three candidate loci, despite the fact that no Mbuti were represented in our initial sequencing survey. Given that the Mbuti population is known to be relatively isolated from other Pygmy and neighboring non-Pygmy populations (26), this suggests that central Africa may have been the homeland of a now-extinct archaic form that hybridized with modern humans. We have relied on an indirect approach to detect ancient admixture in African populations because there are no African ancient DNA sequences to make direct comparisons with our candidate loci. As proof of principle that an indirect approach can be useful, we reexamined the RRM2P4 pseudogene on the X chromosome. Using a similar approximate-likelihood methodology, it was previously posited that a divergent allele at the pseudogene introgressed from an archaic taxon in Asia (27, 28). We compared human and Neandertal RRM2P4 sequences and found that the three derived sites that define the non-African basal lineage are shared with Neandertal (Fig. S4). Thus, we verified that this unusual human sequence, which is characterized by a deep haplotype divergence and a small basal clade, is indeed shared with an archaic form. Further genome-level (i.e., multilocus) analysis will also shed light on the process of archaic admixture, which is likely to be more complicated than we have modeled. For instance, the multimodal likelihood surface in Fig. 2 suggests that gene flow among strongly subdivided populations in Africa may characterize multiple stages of human evolution in Africa. Our results are consistent with earlier inferences supporting the role of archaic admixture in sub-Saharan Africa based on analyses of coding regions (19) and the Xp21.1 noncoding region (16). Although our estimates of isolation and admixture dates are tentative, the results point to relatively recent genetic exchange with an unknown archaic hominin that diverged from the ancestors of modern humans in the Lower-Middle Pleistocene and remained isolated for several hundred thousand years. Despite a fragmentary African fossil record, there are plenty of candidates for the source(s) of this introgression. Beginning ≈700 kya, fossil evidence from many parts of Africa indicate that Homo erectus was giving way to populations with larger brains, a change that was accompanied by several structural adjustments to the skull and postcranial skeleton (14). By ≈200 kya, individuals with more modern skeletal morphology begin to appear in the African record (8, 14). Despite these signs of anatomical and behavioral innovation, hominins with a combination of archaic and modern features persist in the fossil record across sub-Saharan Africa and the Middle East until after ≈35 kya (12, 14). Although there is currently a major debate about the meaning of this piecemeal or mosaic-like appearance of modern traits for taxonomic classification (12, 29), the evidence presented here and elsewhere suggests that long-separated hominin groups exchanged genes with forms that either were in the process of evolving fully modern features, or were already fully modern in appearance. The emerging geographic pattern of unusual variants discovered here suggests that one such introgression event may have taken place in central Africa (where there is a very poor fossil record). Interestingly, recent studies attest to the existence of Late Stone Age human remains with archaic features in Nigeria (Iwo Eleru) and the Democratic Republic of Congo (Ishango) (30⇓–32). The observation that populations from many parts of the world, including Africa, show evidence of introgression of archaic variants (6, 16, 19) suggests that genetic exchange between morphologically divergent forms may be a common feature of human evolution. If so, hybridization may have played a key role in the de novo origin of some our uniquely human traits (33).

Materials and Methods Samples and Regions Sequenced. DNA samples used in this study for resequencing were taken from publicly available cell lines administered by the Centre d’Étude du Polymorphisme Humain Human Genome Diversity Panel (34), while samples used for geographic surveys were described in refs. 22 and 27. Individual identifiers for each of these samples are described in Wall et al. (35). Our updated resequence dataset consists of 61 loci in autosomal intergenic regions (36), each ≈20 kb of primarily single-copy noncoding (i.e., putatively nonfunctional) DNA in regions of medium or high recombination (ρ ≥ 0.9 cM/Mb) at least 100 kb away from the nearest gene (37). We used a locus trio design, sequencing three fragments of ≈2 kb spaced evenly across the region (35). The sample was composed of ≈16 individuals from each population (with the exception of the San, which included nine samples), nearly all of which were males. Although the Hammer et al. (36) dataset includes 30 X-linked loci, we chose not to include them in the current analysis because of the much smaller number of X chromosomes and the need to make assumptions about sex-biased processes. Exact locations are available at http://hammerlab.biosci.arizona.edu/ArchadData/PNAS.archad.locusLocationInfo.xls, and genotyping assays and primers are available at http://hammerlab.biosci.arizona.edu/ArchadData/PNAS.primers.doc. Inferential Approach. Here we implement two types of models, denoted “two-population” and “three-population” model (Fig. 1). Because the two-population model is simpler, it has the advantage of using a broader array of summary statistics and allows evaluation over a finer grid of parameter space. The more complex three-population model is much more computationally expensive, yet has the advantage of considering all three sampled populations simultaneously. Our approximate-likelihood approach allows us to investigate the entire grid of parameter space. In contrast, full-likelihood methods require computationally intensive techniques (e.g., Markov chain Monte Carlo) that limit analysis of parameter space to regions near local maxima, whereas Bayesian methods suffer from the need to use priors that may be poorly justified in this study. Coalescent-Based Model Testing. Two-population model. For each pair of sub-Saharan African populations we consider the demographic model described in Fig. 1A and use a previously published composite-likelihood methodology (18, 19) to estimate parameters ψ = (g 1 , g 2 , T 1 , M) for growth, split time, and migration rate (see SI Materials and Methods for details of model and methodology). This method uses information from levels of diversity and the joint frequency spectrum, but not LD for estimating (composite) likelihoods. For each pair of African populations, we then use the parameters estimated above as a null model and test for the presence of additional ancient population structure (19). If archaic admixture occurs at a locus, then “archaic” SNPs on introgressed sequences would be in strong pairwise LD. Simulations suggest that both the number of such SNPs and the total distance spanned by such SNPs are elevated when archaic admixture occurs (21). To exploit these two observations and to account for the effects of intragenic recombination (18), we calculate, for each locus, a statistic, S*, shown to be sensitive to archaic admixture (18). S* looks for population-specific SNPs that are in strong LD (e.g., the square correlation r2≈1). We determine the significance of S* values from the actual data by running simulations using the previously estimated demographic parameters to obtain a distribution of S* values under the null hypothesis of no (archaic) admixture. Significantly high S* values are interpreted as departures from the null model in the direction of some unknown ancient population structure. The P values across loci are combined (assuming independence) using the method of Fisher (38). Three-population model. The three-population model has nine parameters, three of which (T a , T 0 , and a) are the key parameters for inference (SI Materials and Methods). Ancestral recombination graphs under a grid of parameter space are created using the software tool ms (39) to approximate distributions under several tolerances or bin sizes. All inference is based on approximate-likelihood computations for the three key summary statistics, D 1 , D 2 , D 3 , as described in SI Materials and Methods. Because the parameter space of our null model of no admixture is a subspace of the entire parameter space, we can make inference using a likelihood ratio test. Approximate-likelihood surfaces are generated in two stages. First, we simulate 5,000 ARGs over a coarse grid of parameter space. This allows us to reduce the parameter space to the null space and to those values within the 99% CI of the coarse grid estimates. We then run simulations using 100,000 ARGs for each parameter value, storing approximations of the summary statistic distribution using reduced tolerances. This allows us to perform bootstrap and goodness of fit (GOF) tests for larger tolerances by adding empirical probabilities from the simulations. To test the null model of no admixture, we used tolerances of δ 1 = 0.06, δ 2 = 0.05, and δ 3 = 2 to estimate the approximate likelihood for each locus using local-scale estimates of recombination rate, which yielded an estimate of T a = 40 kya, T 0 = 750 kya, and a = 1% and a log-likelihood ratio of −2.01. To estimate the significance of this value we drew 10,000 points from the maximum-likelihood location under H 0 using our 3D histogram and tabulated the probability of observing a log-likelihood ratio as small (or smaller) than −2.01 with an archaic split time no more recent than 750 kya. To better characterize the alternative model we used a two-tiered approach. First we examined a more refined grid of parameter space, and then we ran two levels of simulations. In the initial pass we generated 5,000 ARGs per parameter value, and in the second pass we took all of the values within the 99% CI and computed 3D histograms of summary statistics for each parameter value in the manner described above. We then used a parametric bootstrap to address GOF as described in SI Materials and Methods. Estimating recombination rates. The three-population methodology is highly sensitive to recombination rates. For this reason, we chose to estimate local-scale recombination and favored these estimates in our inference over the much larger-scale estimates of Kong et al. (37). To this end, we used Phase 2.1 (40, 41) using two qualitatively different strategies. The first uses HAPMAP Yoruba data (42), and the second estimates ρ using the major clade of each locus in our own resequencing data (SI Materials and Methods). We chose the major clade because recently introgressed archaic lineages, if they exist, serve as a barrier to recombination and thus will bias estimates of ρ downward. Locus-specific analyses. To infer the split and admixture times for individual introgressive candidates, we calculate the probability of observing the number of completely correlated sites for the relevant population data (e.g., the Biaka for 4qMB179), assuming panmixia, as a function of the underlying scaled recombination parameter ρ (Table S4). To estimate admixture time (T a ), we first estimate the minimum length of the diverged haplotype. Using the genetic map of Kong et al. (37), we then estimated the total recombination rate for the diverged haplotype. Given an admixture event g generations ago, the distribution of lengths of inherited chromosomal segments roughly follows an exponential distribution with mean genetic distance 1/g. It follows that the maximum-likelihood estimate for the time of admixture is T a = 1/g generations ago, with 95% CI (0.0253 T a –3.69 T a ) generations. We assume a mean generation time of 25 y. Note that we have not accounted for the additional uncertainty in estimating ρ. We use polymorphism data along with an outgroup (orang-utan) sequence to estimate the split time T 0 . We assume that exactly one archaic sequence was introduced into the modern gene pool, leaving the observed number of descendant sequence(s) in the divergent haplotypes. Our general approach was to tabulate polymorphic sites and fixed differences, noting whether the SNPs were polymorphic (or fixed) in the archaic or the modern sequences. We then run coalescent simulations (39) to estimate the likelihood of observing the actual numbers of fixed differences and polymorphisms of different categories, as a function of T 0 , N e, , μ (the mutation rate), ρ, and T a . By simulating over a grid of values with increments 0.25 Myr for T 0 , 1,000 for N e , 1 × 10−9/bp for μ, 2.5 × 10−9 for ρ, and 0.04 N e, generations for T α , we estimate a profile likelihood curve for T 0 . A total of 104 replicates for each parameter combination were sufficient to accurately estimate the likelihood.

Acknowledgments We thank J. Cahill and collaborators, including G. Destro-Bisol, T. Jenkins, H. Soodyall, and L. Louie who donated DNA samples. This research was funded by National Science Foundation HOMINID Grant BCS-0423670 (to M.F.H. and J.D.W.).