Supertree construction

Altogether 257 source trees (obtained by using both distance-based and character-based methods) and 44 admixture plots from 200 published studies (1990–2014) contributed to the resulting supertree dataset. They included trees based on genomic data, including both genome-wide allele frequency data and whole-genome sequences (51 trees from 33 studies), genetic trees based on autosomal data (26, 19), Y-chromosomal data (9, 9), mtDNA (25, 20), human leukocyte antigen (HLA) system (75, 57), “classical polymorphisms” (27, 8), language trees based on lexical or structural data (44, 33), admixture plots based on genomic data (43, 36) and one admixture plot based on linguistic structural data.

The resulting supertree dataset (unpublished) included 973 populations and 5 great apes or archaic hominins that featured in the source trees (see Supplementary methods). Two datasets were then created based on restricted samples of this dataset. The first dataset consisted of 186 populations and included all world regions and major linguistic groups that are reasonably well represented throughout the source trees (“representative dataset” hereafter) (Supplementary Table S2). The second dataset consisted of 52 populations from the Human Genome Diversity Project (HGDP) panel3 that are best represented throughout the source trees, plus three additional populations to represent Australia, Micronesia and Polynesia (“HGDP dataset” hereinafter).

To investigate robustness of the inferred supertree topology, we used a method inspired by the “sensitivity analysis” of Wheeler27. The analysis was carried out by successively reweighting and rerooting the data partitions, adjusting an effect of different data partitions on the resulting supertree topology. In this study, a sensitivity analysis has been used for the first time for supertree inference. We used 16 sets of parameters for both representative and HGDP samples, based on combinations of four weighting and four rooting schemes (see Methods).

Sixteen sets of the most parsimonious (MP) trees, recovered in a sensitivity analysis of the representative dataset, were analyzed using the IterPCR method28,29 in order to identify unstable (“wildcard”30) taxa which cause large polytomies in the supertree, hampering the interpretation of phylogenetic results (see Methods). Four wildcards that decreased resolution of the supertree by five or more nodes (see below) were excluded from the dataset and the pruned version of the representative dataset (182 populations) was used for subsequent analyses.

Supertree topologies and topological stability

Given the expected conflict across different types of data, the resulting supertree topologies based on the representative dataset (Fig. 1a–d and Supplementary Figs S1–S16) were surprisingly well-resolved (Supplementary Table S3).The parameter set 1.A maximizes congruence between data partitions, providing the shortest supertree with the highest CI and RI values (Supplementary Table S3). The resulting supertree topologies are, overall, robust to parameter set changes. Similarity of the resulting supertrees measured by subtree prune and regraft (SPR) distances is 99–74% (Supplementary Table S4a,b). The contribution of admixture plots to the resulting supertree topology was relatively small. The topology of the combined supertree based on parameter set 1.A, where the effect of admixture plots was maximized, was 85% similar to the parameter set 2.B where the effect of admixture plots was minimized while all other data partitions played an equal role (Supplementary Table S4a,b; Supplementary Figs S1 and S6). The admixture plots alone provided a more symmetrical topology with three superclades: the basal African, followed by the N African–W Eurasian and the Eastern superclade (Supplementary Fig. S19). However, the hierarchical clustering of populations in admixture plots is not always comparable to the order of branching events in human population history. The early divergence of some populations (e.g., Hadza, Dogon, Basque and Tibetan) implied by the admixture plots could reflect isolation and random genetic drift rather than early divergence. Some relationships probably reflect relatively recent admixture (e.g., Bantu populations of S Africa4).

Figure 1 (a) Semistrict consensus supertree of 186 human populations (outgroups not shown) based on the representative dataset and parameter set 1.A of the sensitivity analysis (all data partitions were weighted equally and all sources were considered rooted). SOUTH BANTU = Ndebele + Swati + Xhosa + Zulu (often occurred as a composite population in the source trees); AUSTRALIAN consists of Australian Aboriginal populations of unspecified ethnic origin. The color code corresponds to the recovered monophyletic or paraphyletic groups of populations. The wildcard taxa (Qatari, Andamanese, Malagasy, Dayak Ngaju) are displayed (in gray) in the most basal of all positions they acquired when included into the dataset, but were not taken into account when assessing node and group support. The circles indicate presence of the nodes in the strict (white) and semistrict (gray) consensus of 16 supertrees derived from the sensitivity analysis (a circle is absent if the respective node is absent even in the semistrict consensus). The analysis space plots (square grids) describe presence of the selected clades/groups in the supertree under individual parameter sets as either: a monophyletic clade (white); a paraphyletic group or an unresolved section compatible with monophyly or paraphyly (gray); a polyphyletic assemblage (black). Completely white grids (=the group present under all parameter sets) are substituted by small white squares. (b, c) Alternative topology for the N African–W Eurasian assemblage and the Sahul–Oceanian clade as recovered in parameter set 2.C. (d) Alternative topology for the E Asia clade as recovered in parameter sets 4.A–4.D. The nodes where the alternative topologies (b, c, d) begin in the supertree 1.A (a) are denoted by asterisks. (e) Geographic locations of 186 human populations plotted on the world map using QGIS v.2.8 (the color code corresponds to the trees). Full size image

In all 16 topologies provided by the representative dataset (Fig. 1 and Supplementary Figs S1–S16) sub-Saharan Africa is located nearest to the root of the tree, followed by N Africa, the Near East, Europe, S and C Asia, Oceania, E Asia and America. The general branching order is largely consistent with the previously published global human population-level phylogenies, despite major differences in sampling and phylogenetic inference methods used3,4,12,31. All 16 topologies based on representative dataset agreed upon the most basal position of S African Khoisan followed by C African Pygmies and the clade consisting of Fulani Afro-Asiatic (Cushitic) speaking populations as a sister group to Niger-Congo speaking populations (including Bantu). The next paraphyletic section of the supertree included Niger-Congo (Bantu), Nilo-Saharan and Afro-Asiatic (Cushitic, Omotic and Semitic) peoples and the click-speaking Hadza and Sandawe hunter-gatherers of E Africa, The clustering of Chadic speaking populations of C Africa with Niger-Congo speaking populations of this region rather than with Afro-Asiatic speaking populations of E Africa was consistent with the previously published genomic trees3,4. So was the position of Hadza and Sandawe within the ethno-linguistically heterogeneous E African section of the supertree4,12. This section was basal to the monophyletic clade including N African and Eurasian peoples. The latter consisted of the largely unresolved N African–W Eurasian assemblage (N African, Near Eastern, European and S Asian peoples) and the consistently monophyletic Eastern superclade (Sahul–Oceanian, E Asian and Beringian–American peoples). The most remarkable differences between individual topologies derived from different parameter sets concerned W Eurasia, Mainland and Island SE Asia and Oceania and E Asia. In the N African–W Eurasian assemblage, there are highly unstable relationships among its constituent sections (N Africa, Near East, Europe and S Asia), most of which tend to be para- or even polyphyletic (Fig. 1a,b). In Mainland and Island SE Asia and Oceania, different parameter sets imply a different source of the expansion into the area (either from Taiwan or from Malay Peninsula) and a varying degree of admixture of Austronesians with Sahul–Melanesian peoples (Fig. 1a,c). In E Asia, there is an unstable relationship between populations of E and SE Asia and a highly unstable position of some Siberian peoples (Evenki, Ket, Yukaghir) who were either recovered at the basal position within the E Asian clade or within the Beringian–American clade (Fig. 1d).

Sensitivity analysis of the HGDP dataset produced three distinct topologies (Fig. 2). They were, for the most part, congruent with the supertrees based on the representative dataset, although they included a few clades that were not recovered in the representative-dataset supertrees. The topology recovered under parameter set 1.A (Fig. 2a), was the most symmetrical and included monophyletic superclades as follows: sub-Saharan African (with Khoisan–Pygmy and Bantu–E African subclades), N African–W Eurasian (with S Asian, N African–Near Eastern and European subclades) and Eastern (with Sahul–Oceanian, American and E Asian subclades). The topologies recovered under parameter sets 2.A–4.C (Fig. 2b) were fully compatible with the representative-dataset supertrees and in agreement with other studies using similar population samples3,31, regardless of the tree-building techniques used. In the topology recovered under parameter set 4.D (Fig. 2c), the “Oceania” clade situated in the base of the Eastern superclade in most supertrees, was recovered as polyphyletic. The Sahul–Melanesian subclade remained basal to the rest of the Eastern superclade, while the Micronesian–Polynesian (“Remote Oceanian”) subclade was deeply nested within the E Asian populations.

Figure 2 (a) Semistrict consensus supertree of 55 human populations based on HGDP dataset and parameter set 1.A of the sensitivity analysis. Populations were renamed to correspond to those used in the HGDP panel. BANTU = Kikuyu; POLYNESIAN = Samoan + Maori; MICRONESIAN = Kosraean; MELANESIAN = Naasioi; PAPUAN = Goroka; COLOMBIAN = Piapoco + Curripaco. The color code corresponds to Fig. 1. (b) Frequency-differences consensus of 14 supertrees based on parameter sets 1.B–4.C of the sensitivity analysis. (c) Semistrict consensus supertree based on parameter set 4.D of the sensitivity analysis. The geographic color code corresponds to Fig. 1. Full size image

The most important point of conflict among the alternative supertree topologies thus concerned the position of Sahul–Melanesian and Micronesian–Polynesian peoples. Phylogenetic affinities of Sahul–Melanesian peoples varied greatly between the source trees. While in multiple studies, Sahul–Melanesia was placed basally, often as a sister-group to E Eurasia as a whole3,4,12, in others they were nested deeply within SE Asia31,32,33. These topological conflicts reflect the complex population history of Island SE Asia, from early “out-of-Africa” migration via the “southern route”34 through later interactions with Mainland SE Asia9,35 up to the putative “express-train” migration of the Austronesian speakers from Taiwan via the Philippines, Greater and Lesser Sunda Islands and Melanesia to Micronesia and Polynesia36,37,38. The phylogenetic placement of Sahul–Melanesia is further complicated by possible gene flow from India to Australia around the mid-Holocene39.

The supertree topology is notably pectinante in agreement with the previously published global human population-level phylogenies3,12,31. There were just a few apparent major radiations, namely, Bantu and related sub-Saharan populations (Fig. 1a), European or W Eurasian (Fig. 1a,b), SE Asian–Oceanian (with or without the Sahul–Melanesian peoples) (Fig. 1a,c), E Asian (Fig. 1a,d) and Beringian–American. Individual small clades or even individual terminal taxa tended to branch off from the major migration route in E Africa, Near East and S Asia. This topology is consistent with a serial founder effect model, which suggested that human populations have remained in the locations they first colonized after the out-of-Africa expansion, exchanging migrants only at a low rate with their immediate neighbors, until the long-range migrations began to happen.

Wildcard taxa

Twenty-four populations, either terminal taxa or small clades, were identified as wildcards in topologies recovered under one or more parameter sets of the sensitivity analysis (Supplementary Table S5). The populations responsible for the greatest loss of resolution (5 nodes or more) throughout the sensitivity analyses were Andamanese (a wildcard taxon in 14 parameter sets, decreasing resolution by 1–21 nodes; see below), Malagasy (12: 3–23; see below), Dayak Ngaju (2: 1 and 10; identified as either Island or Mainland SE Asians) and Qatari (2: 21; highly unstable position within N African–W Eurasian section of the supertree).

The unstable position of some populations provides clues about conflicts within the dataset, which reflects either the paucity of data or complex population history of the peoples in question. For example, the unstable position of Malagasy reflects a relatively recent (ca. 1,200 ya) migration of Austronesian-speaking people across the Indian Ocean, followed by admixture with E Africans. While linguistic evidence places Malagasy language within Barito group of W Malayo-Polynesian (Austronesian) languages36, Malagasy population exhibit genetic affinities to both SE Asian and E African populations40.

The case of Andaman Islanders is much more complicated. They were recovered either as Sahul–Melanesians or S Asians, or at the base of E Asia under different parameter sets (Supplementary Figs S17 and S18). Position of Andamanese within the Sahul–Melanesian clade is based on the analysis of structural features of language using a Bayesian clustering algorithm38. Initial genetic studies suggested that Andamanese are descendants of an early “out-of-Africa” migration41, while later studies proposed a more recent S or E Indian origin42. Recent studies agree that Andamanese represent an isolated, relatively basal lineage, with possible genetic affinities to both Sahul–Melanesia and S Asia33,39. The relatedness of Andamanese to Sahul–Melanesians, particularly the Papuans, has recently been substantiated also by genomic data43,44.

Assessment of gene–language coevolution

The question of coevolution of genes and languages is considered fundamental but rarely studied by formal phylogenetic methods. Although the genetic and linguistic evolution may often be correlated, the assumption of direct coevolution between genes and languages is evidently misleading45. Evolutionary processes shaping genetic diversity are not directly analogous to those shaping linguistic diversity46 and, consequently, genetic and linguistic data often imply different historical scenarios47.

The supertree dataset included 45 linguistic source trees and one linguistic admixture plot from 34 studies (compared to 213 genetic/genomic source trees and 43 genomic admixture plots from 170 studies). The 535 (~9%) parsimoniously informative characters based on these sources contributed only marginally to the resulting supertree topology (Supplementary Fig. 21). In fact, only a few language families have so far been analyzed phylogenetically and hence numerous areas of the supertree included no linguistic data at all. Only a few small clades were supported by linguistic characters. The supertree topology was, in general, dominated by the genetic/genomic data.

In order to test for monophyly of the proposed linguistic macrofamilies we created two datasets based on formal linguistic classifications (Supplementary Table S6) to be both optimized on and to constrain, the topology of the supertree based on the representative dataset. The first dataset was based on linguistic classification in Ethnologue48 on the level of language families (“Ethnologue” hereinafter). The second dataset (“Greenberg–Ruhlen” hereinafter) included additional characters based on linguistic classification by Ruhlen49 on the level of linguistic macrofamilies and by Greenberg & Ruhlen50 on the level of linguistic stocks within the Amerind macrofamily. The hunter–gatherer populations, who speak the languages of neighboring agriculturalist or pastoralist groups as a result of a relatively recent language shift (C African Pygmies, “Negritos” of Malaysia and Philippines) were not scored for linguistic characters (see Methods).

Optimization of the datasets based on linguistic classification on the representative-dataset supertree showed rather poor fit of the classification on the supertree topology (Ethnologue and Greenberg–Ruhlen datasets’ CI’s were 0.27 and 0.25, respectively, for the purely genetic and 0.31 and 0.28, respectively, for the combined supertree). Within Ethnologue dataset, the best fitting language families were Austronesian, S African Khoisan, Afro-Asiatic (especially Semitic languages) and Indo-European (Supplementary Table S7a). Within Greenberg–Ruhlen dataset, the macrofamily which is by far the most consistent with the supertree topology is Amerind, followed by Austric (and its constituent language families Austronesian and Austroasiatic), Afro-Asiatic, Khoisan and Indo-Hittite. We do not consider the good fit of Amerind on the supertree topology as support for the Amerind hypothesis21, but rather a consequence of geographic and genetic coherence of the presumably Amerind-speaking populations. The linguistic stocks within the Amerind macrofamily are not consistent with the supertree topology (Supplementary Table S7b). The other controversial linguistic macrofamilies, such as Eurasiatic/Nostratic, Macro-Altaic, Dene–Yeniseian and Dene–Caucasian, fitted poorly on the supertree topology. The poor fit of Macro-Altaic and the families that constitutes it (especially the Turkic) is in agreement with the fact that there is only a weak unifying genetic signal for the Turkic-speaking populations across Eurasia51. The expansion of Turkic languages has probably been largely mediated by language replacements rather than demic expansion.

The supertrees constrained by Ethnologue (Fig. 3) and Greenberg–Ruhlen datasets (Supplementary Fig. S22) summarized relationships between groups of populations speaking related languages based on genetic data (congruence between the purely genetic supertree and the supertree constrained by the Greenberg–Ruhlen dataset is illustrated by a tanglegram and by “anticonsensus” trees; Supplementary Figs S22–S24). The Ethnologue dataset included only those language families that are relatively non-controversial and the supertree constrained by Ethnologue classification (Fig. 3) is in many respects similar to the combined supertree, although it is more symmetrical. Importantly, it includes several monophyletic clades that are not based on the linguistic topological constraint used. In sub-Saharan Africa, there is a clade including peoples speaking S African Khoisan, Niger-Congo and Chadic languages. There is also a large clade including peoples speaking Afro-Asiatic (excl. Chadic), Dravidian, Uralic and Indo-European languages, a group roughly coextensive with the hypothesized Eurasiatic/Nostratic macrofamilies17,18,19; however, the Altaic, Chukotko-Kamchatkan and Eskimo-Aleut-speaking peoples (that have been hypothesized to belong to the Eurasiatic/Nostratic macrofamily as well) are not closely related genetically. The relationships between individual language families of the hypothesized Eurasiatic macrofamily17,18 (i.e., Dravidian, Kartvelian, Indo-European, Uralic, Altaic, Chukchi-Kamchatkan and Eskimo-Aleut) are largely consistent with those inferred in by Pagel et al.15. There is, however, no support for monophyly of the hypothesized Eurasiatic languages as a whole (Supplementary Table S7c and Supplementary Fig. S22a). Another large clade includes Altaic, Austric (excl. Austronesian) and Sino-Tibetan languages, together with Korean–Japanese and Ainu. Whereas Ainu was considered related either to Eurasiatic17,18,49 or Austric16 macrofamily, Korean and Japanese were seen as distant relatives of Altaic languages49. On the contrary, the close relationships between Korean–Japanese–Ainu and Sino-Tibetan peoples have no linguistic basis. The clade which includes Australian, Papuan, Melanesian and Andamanese populations is somewhat reminiscent of the controversial Indo-Pacific macrofamily20. The Na-Dene, Eskimo–Aleut and Chukotko-Kamchatkan populations are closely related to Amerind, a connection that has no linguistic basis. Within America, there is a conspicuous basal placement of populations of the Southern Cone (Andean languages according to Greenbeg–Ruhlen; Supplementary Fig. S22b), which could be indicative of an early western route used during the initial colonization of Americas.

Figure 3 The supertree constrained by Ethnologue classification. White circles indicate topological constraints. Grey circles indicate an unconstrained taxon or clade (usually a language isolate) recovered within a constrained one. The geographic color code corresponds to Fig. 1. Full size image

The hunter-gatherer groups that were deliberately not scored for linguistic characters were recovered outside of the clades they belong to based on their linguistic affiliation. C African Pygmies were recovered as a sister group to S African Khoisan (or within the Khoisan family in the supertree constrained by the Greenberg–Ruhlen classification; Supplementary Fig. S22a), providing evidence for shared ancestry among these geographically diverse groups of hunter-gatherers4. E African Hadza (but not Sandawe) were recovered at a basal position within Africa just above the S African Khoisans and C African Pygmies. “Negritos” of Malaysia were placed as a sister group to the whole Eastern superclade and “Negritos” of Philippines were recovered as a sister group to Austronesian language family (or within the clade including languages of Australia and Indo-Pacific languages in the supertree constrained by the Greenberg–Ruhlen classification; Supplementary Fig. S22a), providing evidence for the hypothesis that the “Negrito” populations represent the descendants of the early migration into the area34, with lasting genetic affinities to Sahul–Melanesia39,43,44.

Interestingly, when the supertree topology was constrained to include linguistic-compatible clades, as if the language families were indeed monophyletic, it tended to form more inclusive clades, which are more or less compatible with the proposed linguistic macrofamilies (Eurasiatic/Nostratic, Indo-Pacific and especially Amerind). It is possible that the “macrofamilies” are consistent genetically and geographically rather than linguistically; however, the possibility that historical linguistics is able to reconstruct the most basal relationships between modern human populations should be re-assessed critically6.

The supertree can provide a robust framework for studies concerning evolution of culture52. Such a framework is needed because most cross-cultural comparative studies published to date used language phylogenies5,7. Although language phylogenies provide an excellent proxy for population histories in some regions (e.g., Remote Oceania), this is not universally the case45,47. Linguistic data seems to be unable to provide a global tree of human populations due to a limited timescale over which linguistic inference is possible6. On the other hand, genetic phylogenies, although global, could be unsuitable for studies of cultural evolution, as the population history they inform of can be older than the cultural traits under investigation. A time-calibrated supertree, incorporating all time “strata” of human evolution (and informed by ancient DNA), is needed to elevate the studies of cultural evolution to a global level.