Dataset and genetic diversity

We collected 21 samples from the four major continental African linguistic groups that belong to 15 different African populations which are either agriculturalists or hunter-gatherers (Fig. 1a). In addition, we included four Eurasian samples for this study. Whole-genome sequencing of the 25 male individuals was conducted on Illumina sequencing platforms. Nine samples were newly sequenced for this project while the whole-genome shotgun read data was already published for the remaining 16 individuals. All samples were paired-end sequence at deep coverage (21–47x) (Table 1, Additional file 1: Table S1.2).

Fig. 1 Samples, genetic diversity, and runs of homozygosity. a Geographical, linguistic and life-style distribution of African individuals analyzed. b On the top, pairwise differences per kbp between individuals. Each line corresponds to the genetic differences of a specific individual to the rest of the samples. The line color corresponds to the label color of the individual in the x axis. The value given for the same individual is counted considering differences between its two chromosomes. On the bottom, total length of runs of homozygosity per individual. In blue, smaller lengths (from 0.5 to 1 Mbp); in green, intermedium lengths (from 1 to 1.5 Mbp) and in orange, the largest windows (bigger than 1.5 Mbp), the latter are a sign of inbreeding at population or individual level Full size image

Table 1 Samples and sequencing statistics Full size table

We detected a total of 12.72 million SNPs in 2 Gbp of callable genome (Additional file 1: Table S2.1). We validated the SNP calling of 21 samples by comparing their genotypes with the ones determined from SNP arrays of these individuals. Twelve HGDP samples were evaluated considering the genotypes generated on an Illumina 650Y array, while the nine genuinely sequenced for this project were genotyped in an Affymetrix’s Genome-Wide Human SNP array 6.0. On average, we achieved a genotype sensitivity of 99.67% for the autosomes, 99.56% for the X chromosome, and a heterozygous sensitivity of 99.37% for the HGDP samples. For the other nine individuals, we achieved an overall genotype sensitivity of 98.70% for the autosomes and 99.22% for the X chromosome. The heterozygous sensitivity for these samples is on average 97.25%.

Hunter-gatherers present the highest genetic diversity of all populations, with Khoisan having greater amount of genetic differences than Pygmies (Fig. 1b top, Additional file 1: Figure S4.1). The four Khoisan samples show similar measures of genetic differences to non-Khoisan samples even belonging to three different groups. Pygmies do not form a single cluster; instead, the Baka Pygmy, in comparison with Mbuti Pygmies, displays less genetic differences to other sub-Saharan and North African populations. Sub-Saharan agriculturalist individuals share highly similar values of genetic diversity relative to all other samples, with lower levels than the ones observed in hunter-gatherers but not as reduced as the non-African samples. The only exception is the Toubou individual, who also maintains similar genetic distance to other sub-Saharan samples but is genetically closer to North African and non-African samples. As expected, North African samples are genetically closer to non-African samples than to sub-Saharan individuals, showing a considerable reduction of genetic diversity.

We determined long homozygous regions, or runs of homozygosity (ROH), of at least 0.5, 1, and 1.5 Mbp of callable genome in each sample (Fig. 1b bottom, Additional file 1: Figure S4.2). Overall, the total length of ROH within a genome depends largely on the geographical origin of the individual; this is, relatively similar values are observed within continents while the amount increase as the distance to Africa gets bigger [54]. However, long ROH over 1.5 Mbp do not follow this geographical tendency. Instead, those segments are more frequent in populations in which isolation and consanguineous unions are more common. We observed that sub-Saharan agriculturalists present the lowest amounts of ROH, whereas both Khoisan and Pygmies show higher levels of ROH that are closer to the ones found in North African or Eurasian populations (Fig. 1b bottom). Moreover, there are three samples (Saharawi, Toubou, and Yoruba_HGDP00927) as well as almost all hunter-gatherers with long ROH, which might indicate in-breeding at the population or individual level.

Genetic ancestries and gene flow in African individuals

We explored the correspondence between genetic and geographic diversity in our African samples (Additional file 1: Figure S5.1). We obtained a significant correlation between the first two dimensions of a multidimensional scaling analysis from a genetic distance matrix and the coordinates of the sampled individuals in an African map (R = 0.58; p value based on 1000 replications = 0.003. Removing Bantu individuals, R = 0.655; p value based on 1000 replications = 0.001). This correlation suggests that genetics tends to fit the geographic location of the sampled individuals. In fact, we observed that genetic differentiation tends to increase monotonically with geographic distance between individuals (Additional file 1: Figure S5.2), a pattern that is consistent with a main genetic gradient among African populations. Finally, by means of a Bearing procedure [55], we found that the genetic differentiation in the African continent is in the north-west to the south-east axis (Additional file 1: Figure S5.3). This direction is similar to the north to south angle described by [56] using F st -based distances and SNP microarray data and is consistent with the Sahara desert acting as a genetic barrier between populations at both sides [56]. The fact that our pattern is somehow rotated could be explained by the particular geographical sampling scheme of our study, which tends to be on the north-west/south-east spatial axis (correlation between latitude and longitude of our sampled locations = − 0.536, p value = 0.012).

To define the genetic variation and structure in our dataset, we applied a principal component analysis (PCA) and ran ADMIXTURE [57]. For ADMIXTURE, in order to have more representative samples per population, we downloaded the “Bushman” data library from Galaxy [18, 58]. A total of 374,195 SNPs in 745 samples (the 25 of this study and an additional set of 720 samples from the array that belong to African, European, and Asian populations) were analyzed. We found that seven is the best-supported number of ancestral populations for our data (Additional file 1: Figure S6.2). We named each ancestry after the population/region with the highest proportion of each specific ancestry.

Overall, results from both analyses suggest that African populations can be clustered in four major genetic groups: Khoisan, Pygmy, sub-Saharan agriculturalist, and North Africa (Fig. 2). Consistent with the highest amount of differences observed (Fig. 1b), we found the maximum genetic variance was found between Khoisan and Eurasian populations. With the exception of the Baka individual, the other hunter-gatherer samples in our dataset are mostly represented by a single ancestry; however, it should be noted that the general picture for hunter-gatherers is more complex, with mixed ancestries for most populations (Additional file 1: Figure S6.3). On the other hand, most sub-Saharan agriculturalist individuals present some hunter-gatherer ancestry. The proportion is mainly related to the geographic distance between mixed populations. Dinkas, South African, and West African Bantus present the highest proportions of hunter-gatherer ancestries, and they are geographically the closest populations to Mbuti, Khoisan, and Baka, respectively. The East African Bantu, Laal, and Mandenka individuals show lower proportions of hunter-gatherer ancestries, with values following a dwindling gradient that is concordant with the ascending distance to the Mbuti Pygmy location. Finally, North African samples are closer to Eurasian populations than to any sub-Saharan populations, implying that the Sahara Desert might have represented a major barrier within African populations.

Fig. 2 Principal component analysis (PCA) and ADMIXTURE. a First two components of a PCA, percentage of explained variance shown in axis; African samples are grouped in four major genetic ancestries, representative samples of each ancestry are shown with a circle colored with its correspondent main genetic ancestry estimated in b, North Africans and African samples not circled might be heavily admixed according to b; b ADMIXTURE plot for the 25 samples in our dataset; the seven ancestries are named according to individuals that have almost exclusively a given ancestry. The plot for the remaining 705 samples is shown in Additional file 1: Figure S6.3 Full size image

To formally test admixture, we applied the D-statistics test [59] addressing two scenarios: the admixture between hunter-gatherer populations and their respective geographically surrounding agriculturalist populations (South African Bantu for Khoisans; Laal, Toubou, Dinka, and Eastern Bantu for Mbuti Pygmies; Yoruba and Western Bantu for Baka Pygmies), and the putative gene flow from west Eurasian to African populations. Additionally, we evaluated the latter scenario by calculating F 4 -ratio estimates [59], which provide accurate proportions of European ancestry into African populations. The ratios we constructed were f4(Han, Yoruba; X, Chimp)/f4(Han, Yoruba; French, Chimp), being X a hunter-gatherer population, and f4(Sardinian, Han; X, Yoruba)/f4(Sardinian, Han; French, Yoruba) when X refers to other African groups.

We found clear evidence of admixture between Khoisan populations and the South African Bantu individual, as well as between Dinka and Mbuti Pygmies, as this was consistently observed in several comparisons made using different African populations (Additional file 1: Tables S6.1–2). We also detected signatures of gene flow between Mbuti Pygmies and both Chadian individuals (Laal and Toubou), although with lower significance (Additional file 1: Table S6.2). By contrast, East African Bantu, West African Bantu, or Yoruba populations show no evidence of gene flow with their neighbors, Mbuti and Baka Pygmies (Additional file 1: Tables S6.2–3).

As expected, evidence for admixture between west Eurasians (represented by the French sample) and North African populations was formally identified with the D-statistics test (Additional file 1: Table S6.4). We then estimated an F 4 -ratio [29, 59] and obtained a significant proportion of the Eurasian component present in North African populations, with values as high as 84.9% for the Saharawi individual and 76.0% for the Libyan sample (Additional file 1: Table S6.5). Two other northeastern sub-Saharan populations (Toubou and East African Bantu) also stood out with highly significant D-statistics values, although of lower magnitude. This is concordant with an estimated west Eurasian ancestry proportion found of 31.4% and 14.9%, respectively (Additional file 1: Tables S6.4–5). Finally, the three Khoisan groups present significant small proportions (3.83–4.11%) of Eurasian ancestry. This signature, which was estimated with the F 4 -ratio, was not detectable by the D-statistics test (Additional file 1: Tables S6.4–5).

Effective population size over time

To unravel the ancient demographic history of the African populations that are present in our data set, we used the Pairwise Sequentially Markovian Coalescent (PSMC) model that analyzes the dynamics of the effective population size over time [60]. We included at least one representative of each of the 15 African populations and two Eurasian samples in the analysis (Additional file 1: Figure S7.1) and considered both the classical mutation rate of 2.5 × 10−8 [61] and the 1.2 × 10−8 mutations per bp per generation reported in other analyses [62, 63]. The demographic trajectories of the sub-Saharan agriculturalist populations are very similar to each other; and only South African Bantu and Toubou individuals differ partly from the rest of sub-Saharan farmer samples; however, their considerable levels of admixture with other North African or hunter-gatherer populations (Fig. 2b) might explain this trend. Therefore, in order to ease visualization, we plotted a Yoruba individual (Yoruba_HGDP00936) and two Ju|‘hoansi individuals as representatives of the sub-Saharan agriculturalist and Khoisan populations, respectively (Fig. 3 and Additional file 1: Figure S7.2 considering a mutation rate of 1.2 × 10−8).

Fig. 3 PSMC analyses on eight populations. N e and time have been scaled with a mutation rate of 2.5 × 10−8 and a generation time of 25 years Full size image

Our PSMC analysis recapitulated major demographic events that have previously been reported, including a pan-population bottleneck starting around 100 kya [60]. Out-of-Africa populations started to diverge from African populations around 100 to 110 kya and suffered the highest-in-magnitude population reduction, until their recent expansion. Khoisan individuals displayed larger N e , maintained through all time periods, as recently reported [18]. We observed that ancestors of Mbuti and Baka Pygmies, like Khoisan, maintained a larger effective population size after the split with non-Khoisan/Pygmy populations. Both Khoisan and Pygmy individuals displayed a moderate population decline compared to Eurasian or North African individuals and also compared to Yoruba, which showed intermediate gradual N e reduction. Interestingly, the Baka Pygmy sample showed a sharp increase in N e around 30 kya. In order to discard a possible spurious increase occurring in one specific time period, we changed time parameters of PSMC to obtain a finer scale. The new estimates revealed a bit more gradual increase spanning three different time intervals (Additional file 1: Figure S7.3). Finally, we also tested to which degree a putative contribution of European ancestry into sub-Saharan African genomes could affect any of the above observations. To that effect, we masked, from the genome of each sub-Saharan individual, all genomic regions of European origin, which we previously inferred with RFMix [64] by considering as reference 922 individuals from African or European populations from the 1000 Genomes Project Phase III panel. We repeated the PSMC on the masked genomes obtaining nearly identical trajectories (Additional file 1: Figure S7.4).

Archaic introgression from known hominins

Archaic introgression from either known or unknown extinct hominins has been suggested in different African populations [26, 30, 33,34,35,36,37,38,39]. In our data, we confirmed previous findings [28,29,30], as the results of the D-statistics of the form D(X = African population 1, Y = African population 2; Neanderthal/Denisova; Chimpanzee) showed that Eurasian samples as well as North African individuals exhibit a significant enrichment of Neanderthal DNA (higher in East Asia than in West Eurasia or North Africa) when compared to sub-Saharan African samples (Additional file 1: Figure S8.1). Z-score values are generally lower for signatures of Denisovan introgression than for Neanderthal, meaning that a lower proportion of gene flow is observed when admixture has taken place. Asian samples were enriched in archaic DNA from Denisovans, and the European and North African samples too, but at lower levels. This is probably due to the fact that Neanderthal and Denisova are sister groups and consequently share derived alleles that might confound their admixture signals. We found no signals of Neanderthal or Denisovan introgression in the sub-Saharan individuals, which was additionally confirmed with an F 4 -ratio test for the Neanderthal introgression (Additional file 1: Table S8.1).

Demographic model

We aimed to explore the impact of recent population admixture on the genetic landscape of sub-Saharan populations in an integrative manner, as well as the presence and nature of archaic introgression from hominin populations. To this end, we conducted an Approximate Bayesian Computation (ABC) analysis coupled to a Deep Learning (DL) framework [50] (Additional file 1: Figure S9.1).

We implemented six demographic models (Fig. 4; Additional file 1: Table S9.1) of increasing complexity from a basic one (model A). Model A summarizes accepted features of human demography [65]: (i) presence of archaic populations out of the African continent, represented by the Neanderthal and Denisovans lineages, (ii) introgression from early anatomically modern humans into Neanderthal [44, 45], (iii) introgression from an extremely archaic population into Denisovans [36], (iv) Khoisans at the root of mankind [11, 14,15,16,17,18], (v) Out-of-Africa event of AMHs [3], (vi) archaic introgression of a Neanderthal-like population after the Out-of-Africa event in Eurasian populations [30], and (vii) archaic introgression from a Denisovan-like population in East Asians [31]. Furthermore, we included recent migrations between Europeans to West Africans, Europeans to Mbutis, Europeans to Khoisans, West Africans to Mbutis, West Africans to Khoisans, Mbutis to West Africans, Mbuti to Khoisans, and Khoisans to Mbutis. These last parameters, as well as the introgression of the archaic population in Denisovans, can be considered as nuisance parameters. Model B extends model A by adding a “ghost” archaic population, XAf, directly related to the lineage leading to AMHs. In this model, XAf independently inbreeds with each of the AMH African populations. Model C extends A by considering that the ghost archaic population is directly related to the Neanderthal lineage, Xn. Model D considers that Xn appears in the archaic lineage out of Africa before the Neanderthal and Denisovan split. Model E is a mixture of model B and C. It considers two ghost archaic populations, one that directly split from the lineage that will produce the AMHs and another related to the Neanderthal lineage, both admixing with AMH populations within Africa. Finally, model F mixes the ghost features of models B and D.

Fig. 4 Tested demographic models. Left figures: topology of the demographic models for ABC-DL analyses considering East Asian (EAs), European (Eu), western sub-Saharan (WAf), Mbuti Pygmy (Mbt), and Khoisan (Kho) anatomically modern humans, Altai Neanderthal (N), Neanderthal-like population (NI) with introgressed DNA present in Eurasian populations, Denisova (D), Denisovan-like population (NI) with introgressed DNA present in East Asian populations, an archaic ghost population (Xe) that has left their footprint into Denisovan genome, a putative African extinct basal branch population (XAf), and a second putative archaic ghost population Neanderthal-like (Xn). In all models, recent migrations described in the text are allowed, but not shown in the figure to ease visualization. The posterior probability obtained with our ABC-DL approach is shown for each model; right figure: fitted B model Full size image

First, we estimated the power of the ABC-DL framework to distinguish among the six considered models by using simulated datasets from known models as observed data and running the ABC-DL framework to estimate the posterior probability of each model. Additional file 1: Table S9.2 shows the confusion matrix for the six models using 100 simulations for each model as observed data. Our analysis suggests that the ABC-DL framework cannot identify all the models with the same accuracy; model F shows the lowest P (real model = X | predicted model by ABC-DL = X) = 0.41, whereas models A, B, C, and D show posterior probabilities of correct assignment > 0.5. This is not surprising given that models E and F are the most general ones. Given these results, we applied the ABC-DL to our observed data. Out of the six considered models, the one showing the largest posterior probability is model B (P (model = B|Data) = 0.85), namely the presence of a ghost archaic population directly related with the lineage that produced the anatomically modern humans. Notably, this posterior probability of model B is 11 times greater than the one from the second most supported model (model D) (P (model = D|Data) = 0.078)), a substantial Bayes factor difference [66] that suggests that the best model out of all the compared ones is model B. Remarkably, basic model A, which does not include any kind of archaic introgression in Africa, has a posterior probability close to 0.

Next, we aimed to estimate the posterior probability of each of the 52 parameters of model B by applying the ABC-DL approach. As a preliminary step, we quantified the performance of the ABC-DL framework in simulated data. For each parameter, we ascertained 1000 simulations at random and estimated the posterior distribution using the ABC-DL. Next, we computed the factor 2 statistic (Additional file 1: Table S9.3), which is the number of times that the estimated mean is within the range 50% and 200% of the true value of the parameter (see Excoffier et al. [67] for details). In 96% of the times, the mean of the posterior distribution of the time of split of XAf with the AMH lineage is within the factor 2, suggesting high confidence in using the mean of this parameter as proxy of the real value. The factor 2 of the amount of introgression of XAf to the different African populations ranges between 77% (XAf to West African) and 72% (XAf to Khoisan) and the times that XAf introgression to the African populations is within the factor 2 range are also ~ 80%, much higher than the expected under randomness. According to the factor 2 analysis, the worse performance of using the mean as a proxy is for migration parameters, which show percentages of factor 2 of ~ 50%, similar to the ones that are observed if the mean of the posterior is sampled at random from the prior distribution. Overall, these analyses support that the mean of the posterior distribution obtained by the ABC-DL framework is a good proxy of the real value used in the simulations for most of the parameters.

Finally, we estimated the posterior distributions of the parameters that describe the most supported demographic model (Fig. 4, Table 2, and Additional file 1: Table S9.4). The ABC-DL produced posterior distributions that strongly deviated from the prior distributions that we considered (see Additional file 1: Figure S9.3) for most of the parameters, suggesting that the ABC-DL approach could properly extract the information present in the observed data to update the prior distributions of each parameter. Not surprisingly, most of the parameters showing posterior distributions similar to the prior distributions are the same that showed low factor 2 values in our former analysis. According to our ABC-DL analyses (Table 2), the AMH lineage and the one from the archaic Eurasian populations diverged 603 kya (95% credible interval (CI) ranging from 495.85 to 796.86 kya). The ghost XAf archaic population and the AMH lineage split 528 kya (95% CI of 230.16 to 700.06 kya), whereas the Denisovan and Neanderthal lineages split 426 kya (95% CI from 332.77 to 538.37 kya). Archaic introgression estimates from XAf to African populations range from 3.8% (95% CI 1.7 to 4.8%) in Khoisan and 3.9% (95% CI 1.3 to 4.9%) in Mbuti to 5.8% (95% CI 0.7 to 0.97%) in West Africa. Our analyses also identified the archaic introgression from early AMHs into Neanderthals (mean of the posterior distribution = 1.2%), yet the 95% CI included 0% (95% CI ranging from 0 to 4%).

Table 2 Mean and 95% CI of main parameters of model B Full size table

The obtained estimates of Neanderthal introgression in Eurasian populations in model B are larger (3.9%, 95% CI from 0.017 to 0.048%) than usually reported. Since sub-Saharan populations are traditionally used as outgroup for detecting archaic introgression out of Africa, we wondered whether these estimated values of archaic introgression in Eurasia could be higher than previously by the fact that we were considering in model B archaic introgression within Africa. We conducted the ABC-DL analysis using the model A, the basic model that does not consider XAf (Additional file 1: Table S9.4). The mean of the posterior distribution of the introgression of Neanderthal ancestry in Eurasian populations was 1.1% (95% CI 0.35 to 3.6%), 3.3 times smaller than that obtained in model B and closer to the range of previously reported values.