Abstract Recent arguments connecting Na-Dene languages of North America with Yeniseian languages of Siberia have been used to assert proof for the origin of Native Americans in central or western Asia. We apply phylogenetic methods to test support for this hypothesis against an alternative hypothesis that Yeniseian represents a back-migration to Asia from a Beringian ancestral population. We coded a linguistic dataset of typological features and used neighbor-joining network algorithms and Bayesian model comparison based on Bayes factors to test the fit between the data and the linguistic phylogenies modeling two dispersal hypotheses. Our results support that a Dene-Yeniseian connection more likely represents radiation out of Beringia with back-migration into central Asia than a migration from central or western Asia to North America.

Citation: Sicoli MA, Holton G (2014) Linguistic Phylogenies Support Back-Migration from Beringia to Asia. PLoS ONE 9(3): e91722. https://doi.org/10.1371/journal.pone.0091722 Editor: David Caramelli, University of Florence, Italy Received: August 14, 2013; Accepted: February 15, 2014; Published: March 12, 2014 Copyright: © 2014 Sicoli and Holton. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Financial support was provided by Alaska EPSCoR for an RA in Mark Sicoli's lab. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction The aboriginal populations of America and Asia are linked through prehistoric migrations via the Bering Land Bridge. Our understanding of these migrations has been derived primarily from archaeological and biological data rather than from linguistics as most migrations preceded the generally accepted 8–10,000-year limit of the traditional comparative method of historical linguistics [1], [2]. DNA evidence supports at least three migrations with the earliest 15–40,000 BP referred to generically as the Paleoindian and associated with the greatest distribution of language and cultural groups across North, Meso, and South America; the second 12–14,000 BP is the Na-Dene distributed in North America from Alaska to the Pacific Northwest and from Canada to the U.S. Southwest; and the third ca. 9000 BP is Eskimo-Aleut with circumpolar distribution [3], [4]. Linguists have classified Eskimo-Aleut and Na-Dene as separate language stocks, and the rest of the languages of the Americas as belonging to numerous stocks, but have otherwise been mostly silent on questions that connect Asian and the American populations because, with the exception of Eskimo-Aleut, the dates of these earlier connections lie beyond the traditionally accepted limit for comparative reconstruction. Linguistic claims of more distant relationships have relied instead on the more controversial method of mass (or multilateral) comparison of lexical items subjectively judged as similar [5]. Using such methods a Dene-Yeniseian (DY) connection linking Asia to North America has been suggested for nearly 100 years [6], but only recently has a stronger case been made using methods of linguistic reconstruction [7], which has been peer reviewed with cautious optimism urging alternative methods for its evaluation [8], [9]. The hypothesis of a DY language family prompted claims of proof for the origin of Native Americans in central or western Asia [5], the relationship fitting into a popular narrative for the peopling of the Americas. Our goal here is not to address the validity of the Dene-Yeniseian hypothesis nor the type of linguistic data used to support it. Rather, we address the questions of what it means for migration theories if the DY connection is true and how we can rigorously test hypotheses relating linguistic dispersals with population migrations. We show that Bayesian analysis and neighbor-joining network modeling applied to linguistic datasets provide new insight into the implications of the DY hypothesis. We use typological data to infer linguistic phylogenies that test two dispersal hypotheses. First, Ruhlen’s conjecture that “the origin of the Yeniseian-Na-Dene population can plausibly be traced to West Asia” [5], and second, that a relationship between Yeniseian and Na-Dene represents radiation out of Beringia. We use Bayesian model comparison based on Bayes factors [10] to test the fit between the linguistic phylogenies modeling the two dispersal hypotheses. Our results support an argument that, if the Dene-Yeniseian connection is true, it more likely reflects radiation out of Beringia with both eastward migrations into North America and westward migration into Asia rather than a unidirectional migration from Asia to North America.

Materials and Methods In the last decade, computational phylogenetic tools developed primarily in evolutionary biology have been incorporated into the field of historical linguistics bringing new methods to bear on questions of prehistoric migrations [11], [12], [13], language contact [14], language classification [15], [16], and language universals [17], [18], thereby potentially pushing the upper-limit of historical linguistic inference into the Terminal Pleistocene [19], [20], [21]. Greenhill and Gray [12] advocate the use of a phylogenetic framework to test how linguistic data match migration hypotheses, observing that without such rigorous testing migration scenarios “are little more than plausible narratives.” They argue for the use of Bayesian likelihood modeling over parsimony and use Austronesian lexical cognate sets to test between competing dispersal hypotheses for the Austronesian expansion throughout the Pacific. The use of lexical cognate data closely aligns with data used to infer family relationships in the traditional comparative method of historical linguistics, and the relatively shallow time depth of Austronesian expansion makes lexical cognate data appropriate for Greenhill and Gray’s study. However lexical cognates can be problematic due to a lack of lexical retention at deeper time depths and for families that have undergone extensive lexical borrowing. Wichmann and Saunders [22] review data and methods and propose that “[i]f one goal of linguistic phylogenetics is to infer more ancient relationships than those distinguishable by words alone, typological data may be the only choice.” Dunn and his collaborators [19], [20] pioneered the use of typological databases in modeling evolutionary history using parsimony methods to argue that a trace of phylogenetic signal is detectable from typological data of Papuan languages reflecting a time period in which Australia and New Guinea were joined by a land bridge in the late-Pleistocene continent Sahul. The use of typological data was motivated for Papuan because of the lack of retention of lexical cognates. In contrast, our motivation for using typological data in examining the prehistory of Na-Dene is an abundance of close cognates and inconsistency among isoglosses that have been argued to reflect a long history of lexical borrowing through language contact among related languages [23]. Our focus on typology specifically also takes up the challenge of using alternative methods to consider the position of Yeniseian within the proposed Dene-Yeniseian family which has been otherwise inferred primarily on the basis of lexicon and templatic morphology [7]. The abundance of cognates within Na-Dene presents a challenge when comparing the linguistics with the archaeology. Estimates of time-depth based on lexical comparison are less than 8500 years [24], but the archaeology of Alaska shows temporal horizons well beyond 10,000 years with striking technological continuities with the historically known Na-Dene populations [25]. We applied both Bayesian likelihood modeling and a neighbor joining distance method in evaluating typological features of DY, using a binary coding schema that indicates the presence or absence of phonological and morphological features. Unknown features for a taxon were coded with a question mark. Our data matrix consists of 116 characters for 40 taxa: 2 Yeniseian languages (Ket-Kott), 37 Na-Dene (Tlingit-Eyak-Athabascan) languages, and the isolate Haida included for its potential as an outgroup. The characters we coded for were based on categories represented in Joel Sherzer’s An areal-typological study of American Indian languages north of Mexico [26], with some expansion to include more contrasts between Yeniseian and Na-Dene. Na-Dene character values were first determined from the Sherzer monograph, then checked against other published and unpublished sources in the Alaska Native Language Archive and revised where more current data was available. Yeniseian language character values were determined from a published grammar for the extinct Kott [27], and published grammars for Ket [27], [28] with the Ket coding checked by a Yeniseian specialist. Uncertainty was coded with a question mark. Of the 116 characters, 26 were excluded as uninformative—either all lacking a feature or, to a lesser degree, all possessing a feature—leaving 90 informative characters. Supporting Information for this paper includes the list of features coded as characters (File S1) and the nexus file containing the data matrix (File S2). The neighbor joining analyses used the NeighborNet algorithm of SplitsTree4 [29], an agglomerative clustering algorithm that constructs a splits graph by iteratively combining taxa clusters given the character agreement and disagreement. The Bayesian analysis used the Markov Chain Monte Carlo (MCMC) method implemented in MrBayes [30]. We compared the models using multiple methods of harmonic mean estimation and marginal likelihood scores calculated by the stepping-stone method available through the MrBayes software from which Bayes factor values could be compared. We summarized the MCMC results of the most likely model through both a consensus tree and a consensus network.

Discussion Regardless of the ultimate fate of the DY hypothesis, our work demonstrates the utility of using computational phylogenetic tools to explore the implications of proposals for deep linguistic relationships. While the focus of attention on the DY hypothesis has centered on the potential existence of a linguistic connection between Asia and America, the work described here focuses instead on the implications of such a connection for human migration. Those implications can in turn be compared with evidence from the complementary fields of archaeology and biology. Should the DY hypothesis hold true, our application of computational phylogenetic methods supports an Out-of-Beringia population dispersal (Fig. 4) rather than the Out-of-Central/Western-Asia dispersal proposed by Ruhlen [5]. Bayesian comparison of models using Bayes factors based on marginal likelihood calculations provides no support for the Out-of-Central/Western-Asia hypotheses modeled by a taxonomic constraint that places Yeniseian as diverging early from a Na-Dene clade. Rather, the phylogeny with the strongest Bayes factor supports an early radiation from the center of the geographical distribution of the language family [37] in Beringia with migrations dispersing populations both along the North American Coast and back into Siberia, and subsequently population chains into the North American interior (Fig. 4). While we propose the first linguistically grounded argument for radiation out of Beringia, Tamm et al. [38] have proposed a strikingly parallel set of claims using mtDNA markers to argue for a “Beringian Standstill” before both a rapid early coastal migration into North America and back-migrations from Beringia into Asia. Here we have from linguistic data independent of archaeology or biology contributed to a theory of population dispersal that, while not contradicting the popular narrative of pedestrian hunters entering the New World through Beringia, complicates it with the insight that this was not a one-way trip. There are several clear directions for future work. First, it would be desirable to expand on the typological data set by adding more characters. The findings we have discussed here are based on less than 100 informative characters, and we expect that additional data would make model comparison more robust. Such an expansion is challenging, because many of the Na-Dene and Yeniseian languages are extinct or endangered, which makes it difficult or impossible to expand the dataset evenly. Moreover, the radically templatic character of Na-Dene morphology complicates typological categorization. Another potential for further research is to bring lexical data in where possible. Using a small number of lexical characters Wichmann et al. [39] report more tree-like delta scores for Na-Dene and Yeniseian separately than we find for the combined DY network based on typological characters. This suggests that lexical characters may provide additional insights into the structure of the DY network. We are currently building a lexical dataset as well and plan to create a partitioned data matrix that could model both lexical and typological data together. Currently though we do not have lexical data for as many languages as we have typological data. Finally, there are implications for future work beyond the question of the DY connection. Our modeling has also generated several hypotheses regarding the dispersal of Na-Dene speakers across Coastal and Interior North America developing inquiry in historical linguistics with new methodologies that contribute a uniquely linguistic perspective on questions of prehistory.

Acknowledgments We thank Michael Krauss for comments on early presentations of this work, Edward Vajda for contributing data on Ket and for comments on an early presentation of this work, Ben Potter for the background of the map in Fig. 4 and for comments on an early presentation of this work, Michael Dunn for helpful comments on a draft of this paper, Brendon Fuhs and Margaret Randsell-Green for research assistance, and the PLoS ONE reviewers and editorial staff.

Author Contributions Conceived and designed the experiments: MAS. Analyzed the data: MAS. Wrote the paper: MAS GH. Conceived the typological study: MAS. Developed the data matrix: MAS. Coded data from published sources: MAS. Verified and contributed Na-Dene coding based on unpublished sources in the Alaska Native Language Archive and Yeniseian from published Grammars: GH. Analyzed the data: MAS. Generated the figures: MAS.