Admixture analysis

We merged genotype data from 282 samples from 23 regional and global diversity projects, yielding a total of 5,966 individuals and 19,075 SNPs (Table S1). To address the possible effect of SNP ascertainment bias on F ST estimation, we compared pairwise estimates for the 26 samples from the 1000 Genomes Project31 based on our panel of genotyped SNPs vs. the whole genome sequences. The median difference was 0.0030 (95% confidence interval [−0.0002, 0.0177]), indicating that F ST estimation was not significantly biased by SNP ascertainment or the size of our panel of SNPs.

Unsupervised clustering yielded support for 21 subcontinental ancestries (Fig. 1 and Fig. S1). The posterior mode of K was also 21, with a 100% highest posterior density interval [18, 23]. Of the 21 ancestries, 18 were previously observed2. The only previously observed ancestry not present in this set of 21 was ancestry predominantly found in Cushitic-speaking peoples from East Africa, which we subsequently refer to in shorthand as Cushitic ancestry. Given that Cushitic ancestry has been detected before2, 11, its absence in the current data set indicates a need of additional sampling for proper classification. Our analysis identified three new ancestries: (1) Western African, (2) Circumpolar, and (3) Southern Asian. Our data support the hypothesis that subcontinental geography is a strong proxy for ancestry (Fig. S2). Consequently, we labeled the 21 ancestries on the basis of present-day geographic distributions. The samples that are the best proxies for these ancestries are provided in Table S2 and the mixing proportions of all ancestries for all samples are provided in Table S3. Pairwise F ST estimates between ancestries are provided in Table S4.

Figure 1 Ancestry analysis of the global data set. The 282 samples are labeled alternating in the left and right margins. The 21 ancestral components are Kalash (black), Southern Asian (dark goldenrod), South Indian (slate blue), Central African (magenta), Southern African (dark orchid), West-Central African (brown), Western African (tomato), Eastern African (orange), Omotic (yellow), Northern African (purple), Northern European (blue), Southern European (dark olive green), Western Asian (white), Arabian (light gray), Oceanian (salmon), Japanese (red), Southeastern Asian (coral), Northern Asian (aquamarine), Sino-Tibetan (green), Circumpolar (pink), and Amerindian (gray). Full size image

To investigate the stability of the ancestries, we tested the null hypothesis that no genetic differentiation exists between the previous and current definitions for each ancestry. First, we used Mantel’s test to assess the correlation between the F ST matrix generated with ancestries as defined in this study compared to the one generated with ancestries as previously defined2. The matrices were matched by eliminating the three new ancestries from the current matrix and the Cushitic entry from the previous matrix, resulting in a comparison of two 18 × 18 matrices. The estimated correlation coefficient r = 0.992 was significantly different from ρ = 0 (1.28 × 10−34 ≤ p ≤ 2.56 × 10−5) but not significantly different from ρ = 1 (0.122 ≤ p ≤ 0.596), providing evidence for the overall stability of the clusters. Second, we tested whether F ST was 0 for each of the 18 pairwise comparisons. For 14 ancestries, the previous and current definitions were not significantly different (Table S5). For Southeastern Asian, Sino-Tibetan, Western Asian, and South Indian ancestries, the differences were statistically significant, with changes in F ST ranging from 0.010 to 0.021 (Table S5). Thus, seemingly small changes in the overall cross-validation score do not preclude significant changes in the allele frequency profiles of a subset of ancestries.

We next investigated the extent of ancestral heterogeneity throughout the hierarchy of population structure. First, we found that individuals with mixed ancestry were present on all continents (Fig. S2). Second, mixed ancestry was present in 96.8% of samples (Table S3), with a median of 6 ancestries per sample (95% confidence interval [1, 12]). To illustrate, the GBR (British in England and Scotland) sample had a mixture of 38.1% Northern European and 42.8% Southern European ancestries, with small but significant contributions from seven additional ancestries (Table S3). In the ACB sample (African Caribbeans in Barbados), “African” encompassed six ancestries and “European” encompassed four ancestries (Table S3). Similarly, the ASW sample (People with African ancestry in Southwest USA) included all 10 of these ancestries plus one additional ancestry to account for a Native American component (Table S3). The PUR sample (Puerto Ricans in Puerto Rico) had 13 ancestries. Third, consistent with earlier reports2, 11, mixed ancestry was present in 97.3% of individuals, with a median of 4 ancestries per individual (95% confidence interval [1, 7]).

Migration events

We used TreeMix37 to infer the patterns of population splits and mixtures in the evolutionary history of the 21 ancestries. By analyzing ancestries instead of samples, the underlying model infers the structure of an ancestral population by linking modern ancestries to a common ancestor using ancestry-specific allele frequencies with the effects of recent admixture removed. This analysis revealed three migration events (Fig. 2). One migration event was between Eastern African and Northern African ancestries. This event is supported by the fact that E1b1b1b1a (formerly known as E-M81), the most common Y DNA haplogroup in North Africa, is a descendent of E1b1b, commonly found in Eastern Africa38. Another migration event was between Omotic ancestry and the node leading to Arabian, Northern African, Southern European, and Western Asian ancestries. We did not detect either of these two events previously39. When we added the previously defined Cushitic ancestry to the current set, we did not observe either event, suggesting that both events reflected the absence of Cushitic ancestry. The third migration event, which we did observe previously, was between Northern European and Amerindian ancestries. The identification of Circumpolar ancestry resulted in the migration edge moving from the terminal tip of Amerindian ancestry to the common ancestor of Amerindian and Circumpolar ancestries.

Figure 2 (A) The migration graph. TreeMix analysis suggests that migration events occurred between (1) Eastern African and Northern African ancestries; (2) Omotic ancestry and the node leading to Arabian, Northern African, Southern European, and Western Asian ancestries; and (3) Northern European ancestry and the node leading to Amerindian and Circumpolar ancestries. (B) Majority-rule consensus tree. The migration events were suppressed to emphasize the underlying topology. Full size image

Three previously observed migration events39 were not evident in the current analysis. One, we did not observe an event between Arabian and Cushitic ancestries, because Cushitic ancestry was not present in the current data set. When we integrated the previously defined Cushitic ancestry into the current set, TreeMix grouped Cushitic ancestry with Eastern African and Omotic ancestries and inferred a migration event between Arabian and Cushitic ancestries, consistent with our previous results. Furthermore, Arabian, Eastern African, and Omotic ancestries were not significantly different in the presence or absence of Cushitic ancestry (Table S5). Taken together, these results support the hypothesis that Cushitic ancestry was formed by a mixture event. Two, we previously observed an inferred migration event between Indian and Arabian ancestries. Indian ancestry experienced the largest amount of redefinition with the additional data, whereas Arabian ancestry did not differ (Table S5). When we replaced the previous definition of Indian ancestry with the current one, no migration event was inferred. This result suggests that the original inference of a migration event reflected an underdefined Indian ancestry. Three, we previously observed an event involving Kalash and Northern European ancestries. Kalash ancestry was not significantly different between the two data sets (Table S3). When we added the newly defined Southern Asian ancestry, we observed the Kalash-Northern European event when Kalash ancestry was not grouped in the subtree with Southern Asian ancestry (36% of runs) but not when Kalash ancestry was grouped in the subtree with Southern Asian ancestry (64% of runs).

Language

We were able to annotate 249 samples with language (Table S1). Our data set covers an estimated 21.3% of the 141 primary language families but 97.8% of people40. By focusing on ancestries rather than samples, confounding due to recent admixture is removed. We therefore evaluated correlations among ancestries and languages (Table S6).

Southern African ancestry correlates with Kwadi-Khoe, Kx’a, and Tuu languages (r = 0.960, p = 4.78 × 10−138, Fig. 3A). Central African ancestry corresponds to Pygmies, both Eastern and Western (Table S3). Pygmies are thought to have lost their original language and now speak Niger-Congo or Nilo-Saharan languages, presumably adopted from neighboring tribes41. Consequently, Central African ancestry does not meaningfully correlate with extant language families.

Figure 3 Correlation of ancestry and language. (A) “Combined” refers to Kwadi-Khoe, Tuu, and Kx’a, previously referred to collectively as Khoisan. (B) “+” indicates the combination of the listed language plus all languages listed to the left. Tupian, Arawakan, Quechumaran, Mayan, and Uto-Aztecan are referred to collectively as Amerind. (C) “Combined” refers to Chukotko-Kamchatkan and Eskimo-Aleut, referred to collectively as Paleo-Siberian. Note that inclusion of Yeniseian worsens the correlation. (D) “Combined” refers to Mongolic, Turkic, and Tungusic, referred to collectively as Altaic. Full size image

Eastern African ancestry correlates with the Nilo-Saharan language family (r = 0.715, p = 2.39 × 10−40). Arabian ancestry correlates with the Semitic branch of the Afroasiatic language family (r = 0.774, p = 7.28 × 10−51). The Cushitic branch of the Afroasiatic language family correlates with both Eastern African (r = 0.417, p = 7.17 × 10−12) and Arabian (r = 0.336, p = 5.46 × 10−8) ancestries. This result is consistent with our previous finding that Cushitic ancestry formed by admixture between Nilo-Saharan and Arabian ancestries39. West-Central African ancestry correlates with both Bantu and non-Bantu languages in the Niger-Congo language family (r = 0.895, p = 2.00 × 10−88), whereas Western African ancestry correlates with Mande languages (r = 0.797, p = 5.64 × 10−56). West-Central and Western African ancestries are sibling ancestries (Fig. 2), but this result does not indicate whether Mande languages should be considered as part of the Niger-Congo language family.

Northern African ancestry correlates with the Berber branch of the Afroasiatic language family (r = 0.946, p = 1.48 × 10−122). Arabian and Northern African ancestries are both descended from the lineage that includes all Out of Africa migrants, whereas Omotic ancestry is descended from the lineage that includes all sub-Saharan ancestries (Fig. 2). Omotic ancestry correlates with the Omotic languages (r = 0.777, p = 1.40 × 10−51). Thus, the genomic data support the linguistic hypothesis that the Omotic languages are not part of the Afroasiatic family42.

Amerindian ancestry correlates with Tupian, Arawakan, Quechumaran, Mayan, and Uto-Aztecan languages (r = 0.962, p = 6.17 × 10−142, Fig. 3B), consistent with the hypothesized grouping of all these languages in the Amerind family43. Circumpolar ancestry correlates with both the Eskimo-Aleut and Chukotko-Kamchatkan language families (r = 0.799, p = 1.41 × 10−56, Fig. 3C), which collectively are known as Paleo-Siberian languages. The Athabask sample showed 64% Amerindian, 34% Circumpolar and 2% Northern Asian ancestry; accordingly, the Na-Dené language correlates with both Amerindian and Circumpolar ancestries but not with Northern Asian ancestry. Northern Asian ancestry correlates with Mongolic, Turkic, and Tungusic languages (r = 0.617, p = 1.53 × 10−27), which have been grouped into the Altaic language family. Additionally, Northern Asian ancestry correlates with the Samoyedic branch of the Uralic family, Yukaghir languages, the Mari language isolate, and Yeniseian languages (r = 0.781, p = 2.53 × 10−52, Fig. 3D).

Southern European ancestry correlates with both Italic and Basque speakers (r = 0.764, p = 6.34 × 10−49). Northern European ancestry correlates with Germanic and Balto-Slavic branches of the Indo-European language family as well as Finno-Ugric and Mordvinic languages of the Uralic family (r = 0.672, p = 4.67 × 10−34). Italic, Germanic, and Balto-Slavic are all branches of the Indo-European language family, while the correlation with languages of the Uralic family is consistent with an ancient migration event from Northern Asia into Northern Europe39. Kalash ancestry is widely spread but is the majority ancestry only in the Kalash people (Table S3). The Kalasha language is classified within the Indo-Iranian branch of the Indo-European language family.

South Indian ancestry correlates with the Dravidian language family, the Munda branch of the Austroasiatic language family, and Nihali, which has been alternatively classified as part of the Munda branch or as an isolate (r = 0.740, p = 2.03 × 10−44). Southern Asian ancestry correlates with the Indo-Iranian branch of the Indo-European language family as well as the Dravidian language family (r = 0.678, p = 7.96 × 10−35). Sino-Tibetan ancestry correlates with the Sino-Tibetan language family as well as with Monguor and Mongolic (r = 0.793, p = 3.83 × 10−55). Southeastern Asian ancestry correlates with the Mon-Khmer branch (specifically, Khmer and Vietic but not Khasi languages) of the Austroasiatic language family, the Tai-Kadai language family, and the Hmong-Mien language family (r = 0.686, p = 5.36 × 10−36). Japanese ancestry correlates with the Japonic language family (r = 0.644, p = 1.55 × 10−30). Oceanian ancestry correlates with the Austronesian and Papuan language families (r = 0.954, p = 3.36 × 10−131). Western Asian ancestry correlates with Northeast Caucasian, Northwest Caucasian, and Kartvelian language families as well as the Armenian branch of the Indo-European language family (r = 0.522, p = 831 × 10−19).