Southern African DNA sampling

A study permit was obtained from the Ministry of Health and Social Services (MoHSS), Namibia. Ethics approval to conduct whole-genome sequencing and analysis was obtained from the Institutional Review Board (IRB) or Human Research Ethics Committee (HREC) from three institutions, namely the Pennsylvania State University (IRB #28460 and IRB #28890), the University of New South Wales, Australia (HREC #08089 and HREC #08244) and the University of Limpopo, South Africa (Limpopo Provincial Government #011/2008).

Consents of the participants were obtained verbally (and documented via videotape) or in writing. The consent text was provided in English, Afrikaans or via an interpreter in the native language of each participant (Juu- and Tuu-language). Participants agreed that the data generated will be made freely available to the scientific community.

We consented three males and two females from two indigenous hunter-gatherer groups in the Northern and Southern Kalahari Desert. Each was among the eldest members of their respective communities. Inclusion in the study was also on the basis of their narrated-family history, as well as the remoteness of their geographical location, impeding ease of contact with other groups.

The indigenous Kalahari hunter-gatherers included in this study live in scattered family groups in the vast semi-desert regions of Namibia, an 823,145-km2 country on the southwest coast of Africa with ~2 million inhabitants (ref. 10 and references therein). Today Namibia is home to ~38,000 Khoisan people. In detail, KB1 and KB2 are members of a Tuu-speaking group of the southern Kalahari. NB1 and NB8 are Ju/’hoansi of the northern Kalahari region, separated by ~600 km aerial distance. MD8, belongs to the !Xun (!Kung)-speaking group relocated by the government from the Etosha plains region in the northwestern Kalahari. ABT is a direct descendant from the two major-linguistic groups in southern Africa, namely the Nguni-speakers (~60% of the people of South Africa) via his paternal Xhosa ancestry and from the Sotho-Tswana-speakers (~33% of the people of South Africa) via his maternal Motswana ancestry.

Genome sequencing and read alignments

The samples NB1, NB8, KB1, KB2, MD8 and ABT were sequenced to a depth of 27–55-fold using the Illumina HiSeq sequencing platform. The details regarding samplings, DNA extraction and sequencing are the same as described in Schuster et al.10 The genome sequences for the eight other human samples were downloaded from the NCBI Short Read Archive (SRA). These sequences were aligned to the human reference sequence (GRCh37/hg19) using the BWA (version 0.5.9) aligner (ref. 33). All default parameters were used with the exception of ‘–q 15’, which was used to soft-trim the low quality bases at the 3′ ends of the reads. The reads were then realigned using the GATK IndelRealigner (ref. 34), and the potential PCR duplicates were flagged using the MarkDuplicates tool from the Picard suite (Picard, http://picard.sourceforge.net). Sequence-read data for the six southern African genomes that were sequenced as part of this study have been deposited in the Sequence Read Archive under accession PRJNA263627.

SNP and consensus calls

The diploid consensus sequence for the autosomes was obtained using the ‘mpileup’ command. The option ‘–C 50’ was used to reduce the mapping quality of the reads with multiple mismatches. Locations were marked as missing data in the following cases: (a) The coverage at the location was less than three reads or greater than twice the average coverage of the genome; (b) The RMS mapping quality of the reads at the location was less than 10. The consensus for the X chromosome was derived similarly, but the pseudo-autosomal regions (chrX:60001-2699520 and chrX:154931044-155260560) were filtered as missing data. The heterozygous calls in the male X chromosomes were also discarded as errors.

We used SAMtools version 0.1.18 (ref. 33) to identify the locations of the SNPs, using the option ‘–C 50’ to reduce the mapping quality of the reads with multiple mismatches. SNP locations in the nuclear genome were filtered to maintain SNPs for which coverage in the sample was less than that expected using the Lander–Waterman equation35. We filtered out variant locations where the RMS mapping quality was less than 10, or if the SNP quality was less than 30.

Genotyping data sets

We obtained three genotyping SNP data sets from the following sources: genotyping data from HapMap36, HGDP (CEPH, http://www.cephb.fr/en/hgdp), and Schlebusch et al.5 (Supplementary Table 1). We identified the SNPs common to the three data sets. We merged those with the SNPs from our 14 whole-genome data sets. The resulting data set was then filtered to throw away flipped SNPs (SNPs on the non-reference strand) or possible flipped SNPs: the AT-GC SNPs, the triple-allelic SNPs within the merged data sets, as well as the SNPs that were homozygotes within each data set. This left us with 419,969 SNPs.

For this study, we selected only unrelated individuals. Five pairs of individuals, three Ju/’hoansi and two Southwestern Bantu, were identified as highly related, according to the Identical By Descent (IBD) analysis, which was run using PLINK37. The following five individuals were removed from the data sets based on the greatest fraction of missing data: KSP113, KSP116, KSP117, KSP196, and KSP205. Moreover, for analyses in which we identified populations, we removed the genetic-outlier individuals with respect to their populations. To detect outlier individuals, we performed PCA on the merged genotyping data set. We identified two individuals for removal as outliers in their population: HGDP00980 and KSP109. Therefore, we used 419,969 SNPs from 1,462 individuals for our population genetic analyses in this study (Supplementary Table 1). This SNP data set is available on ‘Bushman’ data library on Galaxy38.

Population structure and admixture estimations

We applied the ADMIXTURE program11 to the merged SNP data set to identify ancient population structure and ancestries of the 14 complete-genome samples. To reduce the effects of biased SNPs on clustering groups in the program, we used only 417,593 SNPs having a minor-allele frequency greater than 0.01 in the entire population of the 1,462 individuals. First, we used the entire data set of 1,462 individuals for the analysis with the number of ancestries K=4–14 (Supplementary Fig. 1). The higher K distinguished many isolated ethnic groups in Asia. Thus we ran the program using selected 490 individuals which are composed of African, European and Asian populations (Fig. 1).

Independently, we performed PCA analysis of the SNP genotyping data set, using EIGENSOFT12. We applied the analysis to the entire data set of 1,462 individuals (Supplementary Fig. 2) and then to only African and European populations including Central Asians (n=967) to examine the gene flow between African and European ancestries (Supplementary Fig. 3). To identify origins of our six southern African individuals precisely, we performed the PCA analysis for only African populations (n=274; Fig. 1c). In this analysis, the ≠Khomani and Nama populations were not included because of their recent gene flow from non-Khoisan ancestries, in order to better examine genetic clusters of individuals.

In order to identify admixtures in each of the six southern African individuals’ genome, we applied three methods: HAPMIX39, PCAdmix40 and dpmix41. To prepare the data sets for the run of PCAdmix, we needed to determine phased genotypes. The phased genotype data set of the same HapMap panel ( http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes.phase1_release_v3/) and the data set by Schulebusch et al.5 For the unphased data sets, we inferred a haplotype phase using BEAGLE42. PCAdmix requires three ancestor populations in order to infer the ancestry of each haplotype. Based upon the ADMIXTURE results, the samples that contained a high proportion (>0.8) of the corresponding ancestry were selected as the ancestral population, because the genetic components of the ancestral populations would likely affect the inference by PCAdmix. We used the Khoisan (n=67), Yoruba (n=85) and European (n=82) populations as putative ancestral populations for inference of ancestry of our Khoisan individuals. For the inference of ABT, we used a southeastern Bantu population with ABT as a test population. For the inference of non-African genomes, we assumed Yoruba, European and Asian (n=81) ancestral populations. We fixed the window size of the haplotype to 40 SNPs and used default for other parameters in the PCAdmix analysis. The estimated local ancestry of our six southern African samples and each from Yoruba, European and Asian samples were illustrated in Supplementary Fig. 4.

Since HAPMIX assumes only two ancestral populations, we ran the program for two sets of ancestral populations (Khoisan/Yoruba and Khoisan/European) independently. Two parameters of lambda and theta were given: the time of admixture and the admixed proportion, respectively. For both runs, we fixed lambda=5, and theta as 0.01 for admixtures from the Yoruba population and 0.001 for admixtures from the European population, according to the results of the ADMIXTURE analysis, which showed no significant admixtures from European ancestry. Before we determined the parameters, we tested several sets of parameters, and the results were pretty robust. The two sets of outputs (Khoisan/Yoruba and Khoisan/European) were unified into a single result set. When inconsistent estimates occurred between the two outputs, such as KHO/KHO versus KHO/YRI, the admixed type (KHO/YRI) was chosen, so that the unified result was biased toward detecting admixtures. When two different admixtures were inferred for one genotype, for example, KHO/YRI versus KHO/EUR, it was classified into the ‘undetermined’ category.

The third method of inferring admixtures was applied, by using the dpmix program41 on Galaxy38. We used the ‘Admixture’ tool in the ‘Genome diversity’ session. The only parameter we needed to determine was ‘genotype switch penalty’: we fixed the parameter of 10, which determined a reasonable length of admixture blocks, after testing genome switch penalties in the 2–20 range. The dpmix program allowed us to assume three ancestral populations, and we used the same three ancestral populations defined for the PCAdmix analysis.

We compared results between those three methods and collected the SNPs for which all three methods supported a consistent ancestry. The SNPs showed inconsistent ancestry among methods were assigned into the ‘undetermined’ category. The consistent results were illustrated in Supplementary Fig. 5, and a comparison between methods was shown in Supplementary Fig. 6 and Supplementary Table 2.

Effective population size inference

We ran the PSMC program17 on each of our 14 whole-genome sequencing data sets in order to infer effective population size. For the run, we used the consensus called using SAMtools as described before. The parameters and options used with PSMC were the same as the ones used in a previous study17. We measured the variance of the estimate by bootstrapping, using the option provided in the PSMC package. We repeated the run 100 times for each of the individuals (Supplementary Fig. 14).

There were a couple of issues relative to the understanding of the PSMC estimates. Since PSMC uses the heterozygote density for its inference, sequencing quality affects the PSMC estimates significantly. Sequence coverage of the 14 genome data sets varied between 19–86-fold (Table 1). Considering the associations of sequencing coverage and number of heterozygous sites (Supplementary Fig. 15), the ABT genome was sequenced to a high enough coverage (27-fold) to identify the heterozygotes. In addition, we designed an experiment to examine the impact of sequencing coverage on the PSMC estimates. We generated eight genome data sets having 10–80-fold sequencing coverage from one Yoruba genome (NA18507), which, at 86-fold, has the highest sequencing coverage, among our data sets (Table 1). Using the eight data sets, we ran PSMC independently. The pattern of changes in the PSMC estimates was very similar among those genome data sets: however the degree of change was shifted toward being more recent and smaller in size as the sequencing coverage was lowered (Supplementary Fig. 16). From this experiment, we found that a sequencing coverage cutoff of 30-fold average is sufficient for robust PSMC outputs. Therefore, half of our complete-genome sequencing data sets are suitably qualified for PSMC analysis (Table 1).

Only four genome data sets used in this study were not sequenced to a higher coverage than the ABT genome: NA12891, NA19238, NA19239 and AK1. The PSMC estimates of these four genomes were corrected, using the ‘false negative rate (FNR)’ option provided by the PSMC package. We used the KB1 genome data set to test the effectiveness of this option. We generated the KB1 genome sequence, having 15-fold sequencing coverage, and ran PSMC (KB1.15X). We used the FNR option to correct the estimates, which were shown to be KB1.15X (FNR=0.1) and KB1.15X (FNR=0.2), respectively in Supplementary Fig. 17. The original KB1 estimate was pretty similar to KB1.15X (FNR=0.2). We ended up using the FNR option for the four genomes NA12891, NA19238, NA19239 and AK1 to adjust the effect of low sequencing coverage on the PSMC estimates (Fig. 3).

The second issue was that scaling the estimates depends upon the mutation rate. Estimates of both effective population size (N e ) and time depend upon the mutation rate assumed in the PSMC analysis. Classically, the mutation rate has been known to be 2.5e−08 per site per generation18, while recent studies reported a lower rate of 1.2e−08 per site per generation30,31,32. In this study, we used 2.5e−08 per site per generation as a mutation rate, because the mutation rate estimated by a Bayesian inference (G-PhoCS; Supplementary Fig. 11) was around 2.5e−08, on the basis of our genome data set. The effect of the mutation rate is described in Supplementary Fig. 12.

In addition, the PSMC estimates for the very recent time period, such as 20–42 kyr ago to present, may not produce accurate estimations, according to the author17 and our bootstrap test (Supplementary Fig. 14). Therefore, we don’t discuss PSMC estimates in relation to recent history ~20 kyr ago, in this study.

Coalescence simulations

In order to confirm a robustness of the PSMC inference (Fig. 3), we performed coalescence simulations. Under several demographic models, nucleotide sequences were generated, using the ms program43. Commands used in this analysis and details are shown in Supplementary Fig. 9. The PSMC program was applied to the simulated sequences to confirm to reconstruct the effective population-size over time as given model and effects of recent demographic events on the PSMC estimations of past effective population size.

The estimation of population divergence time

We used PSMC analysis to infer the divergence time between populations17,44. If PSMC analysis is applied, a hybridized diploid of two haploid sequences from each population, then the time point where the inferred population size increased corresponds to the divergence time. To apply this approach, we constructed each pseudo-diploid X chromosome by combining two male X chromosomes. Supplementary Fig. 10 shows the results of PSMC analysis for pairwise pseudo-diploid X chromosomes. For example, the pairs of the European genome (JW) and each of the Khoisan, Yoruba and Asian populations are shown in Supplementary Fig. 10a. The Khoisan and European pair (JW-NB1) demonstrated a clear increase in their population size compared with the European pair (JW-NA12891) from 150–313 kyr ago, which could be the divergence time of the Khoisan and non-Khoisan populations. All other pairs stayed low in population size relative to the Khoisan and European pair, until 270–570 kyr ago. This result clearly shows the earliest population splits between the Khoisan and non-Khoisan populations, even though the time estimation is approximate.