Abstract The Turkic peoples represent a diverse collection of ethnic groups defined by the Turkic languages. These groups have dispersed across a vast area, including Siberia, Northwest China, Central Asia, East Europe, the Caucasus, Anatolia, the Middle East, and Afghanistan. The origin and early dispersal history of the Turkic peoples is disputed, with candidates for their ancient homeland ranging from the Transcaspian steppe to Manchuria in Northeast Asia. Previous genetic studies have not identified a clear-cut unifying genetic signal for the Turkic peoples, which lends support for language replacement rather than demic diffusion as the model for the Turkic language’s expansion. We addressed the genetic origin of 373 individuals from 22 Turkic-speaking populations, representing their current geographic range, by analyzing genome-wide high-density genotype data. In agreement with the elite dominance model of language expansion most of the Turkic peoples studied genetically resemble their geographic neighbors. However, western Turkic peoples sampled across West Eurasia shared an excess of long chromosomal tracts that are identical by descent (IBD) with populations from present-day South Siberia and Mongolia (SSM), an area where historians center a series of early Turkic and non-Turkic steppe polities. While SSM matching IBD tracts (> 1cM) are also observed in non-Turkic populations, Turkic peoples demonstrate a higher percentage of such tracts (p-values ≤ 0.01) compared to their non-Turkic neighbors. Finally, we used the ALDER method and inferred admixture dates (~9th–17th centuries) that overlap with the Turkic migrations of the 5th–16th centuries. Thus, our results indicate historical admixture among Turkic peoples, and the recent shared ancestry with modern populations in SSM supports one of the hypothesized homelands for their nomadic Turkic and related Mongolic ancestors.

Author Summary Centuries of nomadic migrations have ultimately resulted in the distribution of Turkic languages over a large area ranging from Siberia, across Central Asia to Eastern Europe and the Middle East. Despite the profound cultural impact left by these nomadic peoples, little is known about their prehistoric origins. Moreover, because contemporary Turkic speakers tend to genetically resemble their geographic neighbors, it is not clear whether their nomadic ancestors left an identifiable genetic trace. In this study, we show that Turkic-speaking peoples sampled across the Middle East, Caucasus, East Europe, and Central Asia share varying proportions of Asian ancestry that originate in a single area, southern Siberia and Mongolia. Mongolic- and Turkic-speaking populations from this area bear an unusually high number of long chromosomal tracts that are identical by descent with Turkic peoples from across west Eurasia. Admixture induced linkage disequilibrium decay across chromosomes in these populations indicates that admixture occurred during the 9th–17th centuries, in agreement with the historically recorded Turkic nomadic migrations and later Mongol expansion. Thus, our findings reveal genetic traces of recent large-scale nomadic migrations and map their source to a previously hypothesized area of Mongolia and southern Siberia.

Citation: Yunusbayev B, Metspalu M, Metspalu E, Valeev A, Litvinov S, Valiev R, et al. (2015) The Genetic Legacy of the Expansion of Turkic-Speaking Nomads across Eurasia. PLoS Genet 11(4): e1005068. https://doi.org/10.1371/journal.pgen.1005068 Editor: Graham Coop, University of California Davis, UNITED STATES Received: November 28, 2013; Accepted: February 11, 2015; Published: April 21, 2015 Copyright: © 2015 Yunusbayev et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited Funding: This work was supported by European Union European Regional Development Fund through the Centre of Excellence in Genomics for the Estonian Biocentre and the University of Tartu, by the Estonian Institutional Research grant IUT24-1, by the European Commission grant 205419 ECOGENE to the EBC, by the Estonian Science Foundation grant nr8973, and by the Estonian Basic Research Grant SF 0270177s08; Russian Federation President Grant for young scientists (MK-2845.2014.4) to BY; the Russian Academy of Sciences Program for Fundamental Research "Biodiversity and dynamics of gene pools" to EK; the Federal Agency of Education and Science of the Russian Federation (state contracts 02.740.11.0701 and P325 to EK); the Russian Foundation for Basic Research (grant number 11-04-00652_a to EK); the Russian Foundation for Humanities (grant number 13-11-02014/U to EK); Committee for Coordination Science and Technology Development of Republic of Uzbekistan (grant number FA-A6-T180 to ST); EGC-UT received targeted financing from Estonian Government SF0180142s08, Center of Excellence in Genomics (EXCEGEN), and University of Tartu (SP1GVARENG). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Linguistic relatedness is frequently used to inform genetic studies [1] and here we take this path to reconstruct aspects of a major and relatively recent demographic event, the expansion of nomadic Turkic-speaking peoples, who reshaped much of the West Eurasian ethno-linguistic landscape in the last two millennia. Modern Turkic-speaking populations are a largely settled people; they number over 170 million across Eurasia and, following a period of migrations spanning the ~5th–16th centuries, have a wide geographic dispersal, encompassing Eastern Europe, Middle East, Northern Caucasus, Central Asia, Southern Siberia, Northern China, and Northeastern Siberia [2–4]. The extant variety of Turkic languages spoken over this vast geographic span reflects only the recent (2100–2300 years) history of divergence, which includes a major split into Oghur (or Bolgar) and Common Turkic [5, 6]. This period was preceded by early Ancient Turkic, for which there is no historical data, and a long-lasting proto-Turkic stage, provided there was a Turkic-Mongolian linguistic unity (protolanguage) around 4500–4000 BCE [7, 8]. The earliest Turkic ruled polities (between the 6th and 9th centuries) were centered in what is now Mongolia, northern China, and southern Siberia. Accordingly, this region has been put forward as the point of origin for the dispersal of Turkic-speaking pastoral nomads [3, 4]. We designate it here as an “Inner Asian Homeland” (IAH) and note at least two issues with this working hypothesis. First, the same approximate area was earlier dominated by the Xiongnu Empire (Hsiung-nu) (200 BCE–100 CE) and later by the short-lived Xianbei (Hsien-pi) Confederation (100–200 CE) and Rouran State (aka Juan-juan or Asian Avar) (400–500 CE). These steppe polities were likely established by non-Turkic-speaking peoples and presumably united ethnically diverse tribes. It is only in the second half of the 6th century that Turkic-speaking peoples gained control of the region and formed the rapidly expanding Göktürk Khaganate, succeeded soon by numerous khanates and khaganates extending from northeastern China to the Pontic-Caspian steppes in Europe [2–4]. Secondly, Göktürks represent the earliest known ethnic unit whereby Turkic peoples appear under the name Turk. Yet, Turkic-speaking peoples appear in written historical sources before that time, namely when Oghuric Turkic-speaking tribes appear in the Northern Pontic steppes in the 5th century, much earlier than the rise of Göktürk Khaganate in the IAH[9]. Thus, the early stages of Turkic dispersal remain poorly understood and our knowledge about their ancient habitat remains a working hypothesis. Previous studies based on Y chromosome, mitochondrial DNA (mtDNA), and autosomal markers show that while the Turkic peoples from West Asia (Anatolian Turks and Azeris) and Eastern Europe (Gagauzes, Tatars, Chuvashes, and Bashkirs) are generally genetically similar to their geographic neighbors, they do display a minor share of both mtDNA and Y haplogroups otherwise characteristic of East Asia [10–15]. Expectedly, the Central Asian Turkic speakers (Kyrgyz, Kazakhs, Uzbeks, and Turkmens), share more of their uniparental gene pool (9–76% of Y chromosome and over 30% of mtDNA lineages) with East Asian and Siberian populations [16, 17]. In this regard, they differ from their southern non-Turkic neighbors, including Tajiks, Iranians, and different ethnic groups in Pakistan, except Hazara. However, these studies do not aim to identify the precise geographic source and the time of arrival or admixture of the East Eurasian genes among the contemporary Turkic-speaking peoples. The “eastern” mtDNA and even more so Y-chromosome lineages (given the resolution available to the studies at the time) lack the geographic specificity to explicitly distinguish between regions within Northeast Asia and Siberia, and/or Turkic and non-Turkic speakers of the region [18, 19]. Several studies using genome-wide SNP panel data describe the genetic structure of populations in Eurasia and although some include different Turkic populations [15, 20–24], they do not focus on elucidating the demographic past of the Turkic-speaking continuum. In cases where more than one geographic neighbor is available for comparison, Turkic-speaking peoples are genetically close to their non-Turkic geographic neighbors in Anatolia [22, 25], the Caucasus [15], and Siberia [21, 23]. A recent survey of worldwide populations revealed a recent (13th–14th century) admixture signal among the three Turkic populations (Turks, Uzbeks, and Uygurs) and one non-Turkic population (Lezgins) with Mongolas (from northern China), the Daurs (speaking Mongolic language), and Hazaras (of Mongol origin) [26]. This study also showed evidence for admixture (dating to the pre-Mongol period of 440–1080 CE) among non-Turkic (except Chuvashes) East European and Balkan populations with the source group related to modern Oroqens, Mongolas, and Yakuts. This is the first genetic evidence of historical gene flow from a North Chinese and Siberian source into some north and central Eurasian populations, but it is not clear whether this admixture signal applies to other Turkic populations across West Eurasia. Here we ask whether it is possible to identify explicit genetic signal(s) shared by all Turkic peoples that have likely descended from putative prehistoric nomadic Turks. Specifically, we test whether different Turkic peoples share genetic heritage that can be traced back to the hypothesized IAH. More specifically, we ask whether this shared ancestry occurred within an historical time frame, testified by an excess of long chromosomal tracts identical by descent between Turkic-speaking peoples across West Eurasia and those inhabiting the IAH. To address these questions we used a genome-wide high-density genotyping array to generate data on Turkic-speaking peoples representing all major branches of the language family (Fig 1B). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Geographic map of samples included in this study and linguistic tree of Turkic languages. Panel A) Non-Turkic-speaking populations are shown with light blue, light green, dark green, light brown, and yellow circles, depending on the region. Turkic-speaking populations are shown with red circles regardless of the region of sampling. Full population names are given in S1 Table Panel B) The linguistic tree of Turkic languages is adapted from Dybo 2004 and includes only those languages spoken by the Turkic peoples analyzed in this study. The x-axis shows the time scale in kilo-years (kya). Internal branches are shown with different colors. https://doi.org/10.1371/journal.pgen.1005068.g001

Discussion Our ADMIXTURE analysis (Fig 2) revealed that Turkic-speaking populations scattered across Eurasia tend to share most of their genetic ancestry with their current geographic non-Turkic neighbors. This is particularly obvious for Turkic peoples in Anatolia, Iran, the Caucasus, and Eastern Europe, but more difficult to determine for northeastern Siberian Turkic speakers, Yakuts and Dolgans, for which non-Turkic reference populations are absent. We also found that a higher proportion of Asian genetic components distinguishes the Turkic speakers all over West Eurasia from their immediate non-Turkic neighbors. These results support the model that expansion of the Turkic language family outside its presumed East Eurasian core area occurred primarily through language replacement, perhaps by the elite dominance scenario, that is, intrusive Turkic nomads imposed their language on indigenous peoples due to advantages in military and/or social organization. When the Turkic peoples settled across West Eurasia are compared with their non-Turkic neighbors, they demonstrate higher IBD sharing with populations from SSM and Northeast Siberia (Fig 3). There are, however, two non-Siberian populations that also demonstrate high IBD sharing with the tested Turkic peoples, the Kalmyks and Maris. These exceptions need careful consideration in light of historical data and previously published studies. For example, the Mongol-speaking Kalmyks migrated into North Caucasus from Dzhungaria (the northwestern province of China at the Mongolian border) only in the 17th century [37], while Maris stand out from other geographic neighbors due to unusually high recent admixture with Bashkirs: they demonstrate higher IBD sharing with Bashkirs for all IBD tract length classes (from 1–2 cM up to 11–12 cM) compared to other populations in the region (p < 0.05). This might be explained by the fact that we collected Maris samples in the Republic of Bashkortostan, where they seemingly intermarried with Bashkirs to some extent. Finally, some of the Siberian populations are in fact migrants in their current locations. For example, Yakuts, Evenkis, and Dolgans largely stem from the Lake Baikal region, which is essentially the SSM area [23]. It turns out that most of the populations showing a high signal of IBD sharing with the western Turkic populations originated from the SSM area or had admixture with one of the tested Turkic populations. The only exception is the Nganasans; they demonstrate unusually high IBD sharing with both western Turkic peoples (Fig 3) and randomly chosen non-Turkic populations (S3 Fig). Taking into account that SSM area populations (Tuvans, Mongols, and Buryats) can be reliably considered indigenous to their locations, and that other Siberian and non-Siberian populations (demonstrating high IBD sharing with western Turkic peoples in Fig 3) all have SSM origins, we suggest that ancestral populations from this area contributed recent gene flow into western Turkic peoples. We note that SSM matching IBD tracts were also observed with considerable frequency among some non-Turkic peoples, such as Adyghe, Maris, North Ossetians, and Udmurts (Fig 4 and S4 Fig) suggesting that gene flow from the SSM area also contributed to non-Turkic populations. In this regard, alternative explanations not related to Turkic and Mongolic migrations cannot be excluded, but these historical events remain the most likely scenario, since the high proportion of SSM matching tracts is a unifying hallmark of many western Turkic peoples and such a correlated signal of sharing with Siberian populations is not observed for any other group of populations (S3 Fig). Thus, it is likely that migrants of SSM origin interacted with many of the ancestors of contemporary West Eurasian populations, but it was the stronger interaction (reflected in higher IBD sharing) with migrant SSM ancestors that drove Turkicization. We performed a permutation test for each western Turkic population and the observed excess of IBD sharing (compared to non-Turkic neighbors) with the SSM area populations was statistically significant (Fig 4 and S4 Fig). Another important outcome of our IBD sharing analysis is the finding that two of the three SSM populations that we consider “source populations” or modern proxies for source populations are both Mongolic-speaking. This observation can be explained in several ways. For example, one may surmise that the Mongol conquests, starting in the 13th century, were accompanied by their demographic expansion over the territories already occupied, in part, by Turkic speakers, and this led to admixture between Turkic and Mongolic speakers. Alternatively, it is also probable that the ancestors of Turkic and Mongolic tribes stem from the same or nearly the same area and underwent numerous episodes of admixture before their respective expansions. The latter explanation is indirectly testified by a complex, long-lasting stratigraphy of Mongolian loan words in Turkic languages and vice versa [38]. The first explanation is unlikely from a historical perspective since although Mongolic conquests were launched by Genghis Khan troops in the early 13th century, it is well known that they did not involve massive re-settlements of Mongols over the conquered territories. Instead, the Mongol war machine was progressively augmented by various Turkic tribes as they expanded, and in this way Turkic peoples eventually reinforced their expansion over the Eurasian steppe and beyond [39]. Therefore, we prefer the second explanation, although we cannot entirely exclude the Mongol contribution, especially in light of admixture dates that overlap with the Mongol expansion period. Finally, our IBD sharing analysis suggested that the SSM area is the source of recent gene flow. This area is one of the hypothesized homelands for Turkic peoples and linguistically related Mongols. While the presence of the Mongol empire over this territory is well recorded, historical sources alone are insufficient to unambiguously associate this area with the Turkic homeland for several reasons: some of the Turkic groups speaking the Oghuric branch of Turkic were attested westerly in the Pontic-Caspian steppes in the mid-late 5th century CE. This is geographically distant from the SSM area, and temporarily much earlier than the Göktürk Empire was established in the SSM area. Thus, our study provides the first genetic evidence supporting one of the previously hypothesized IAHs to be near Mongolia and South Siberia. The gene flow from the SSM area that we inferred based on our IBD sharing analysis should also be detected using an alternative approach such as ALDER, which is based on the analysis of linkage disequilibrium (LD) patterns due to admixture. Using the ALDER method, we tested all possible combinations of reference populations in our dataset. LD decay patterns observed among western Turkic populations were consistent with admixture between West Eurasian and East Asian/Siberian populations (see detected reference populations in S4 Table). Admixture dating with the set of East Asian/Siberian populations (S4 Table) inferred admixture events ranging between 816 CE for Chuvashes and 1657 CE for Nogais. We chose these reference populations based on the highest LD curve amplitudes, as suggested by the authors of the method. It is notable that all the SSM populations that were inferred to be the source of SSM gene flow were either filtered out by an ALDER pre-test procedure because of the shared admixture signal with the tested Turkic populations or had a lower amplitude of the weighted LD curve compared to the non-admixed references. Indeed, as we show, SSM populations and the two Northeastern Siberian populations all demonstrated a statistically significant admixture signal between the same set of West Eurasian and East Asian populations as western Turkic peoples do (S2 Table and S4 Table). Therefore, the set of reference populations reported in S4 Table that demonstrate the highest LD curve amplitude, in fact represent the set of non-admixed reference populations (S2 Table) that passed ALDER’s filtering procedure. This filter removes any reference population that shows shared admixture signal with the tested population. It was important for our study that the range of ALDER-inferred admixture dates overlaps with the major Turkic migrations and later Mongolic expansion (Fig 5), both of which are known to trigger nomadic migrations to Medieval Central Asia, the Middle East and Europe. In addition, when linguistic classification and regional context is taken into account, we found parallels with large-scale historical events. For example, the present-day Tatars, Bashkirs, Kazakhs, Uzbeks, and Kyrgyz span from the Volga basin to the Tien-Shan Mountains in Central Asia, yet (Fig 5) showed evidence of recent admixture ranging from the 13th to the 14th centuries. These peoples speak Turkic languages of the Kipchak-Karluk branch and their admixture ages postdate the presumed migrations of the ancestral Kipchak Turks from the Irtysh and Ob regions in the 11th century [37]. There are exceptions, like the Balkars, Kumyks, and Nogais in Northern Caucasus, who showed either earlier dates of admixture (8th century) or much later admixture between the 15th century (Kumyks) and 17th century (Nogais). Chuvashes, the only extant Oghur speakers showed an older admixture date (9th century) than their Kipchak-speaking neighbors in the Volga region. According to historical sources, when the Onogur-Bolgar Empire (northern Black Sea steppes) fell apart in the 7th century, some of its remnants migrated northward along the right bank of the Volga river and established what later came to be known as Volga Bolgars, of which the first written knowledge appears in Muslim sources only around the end of the 9th century [40]. Thus, the admixture signal for Chuvashes is close to the supposed arrival time of Oghur speakers in the Volga region. Differences in admixture dates for the three Oghuz speaking populations (Azeris, Turks, and Turkmens) were notable and their geographical locations suggest a possible explanation. Anatolian Turks and Azeris, whose Central Asian ancestors crossed the Iranian plateau and became largely inaccessible to subsequent gene flow with other Turkic speakers, both have evidence of earlier admixture events (12th and 9th centuries, respectively) than Turkmens. Turkmens, remaining in Central Asia, showed considerably more recent admixture dating to the 14th century, consistent with other Central Asian Turkic populations and most likely due to admixture with more recent, perhaps recurrent, waves of migrants in the region from SSM. In summary, our collection of samples, which covered the full extent of the current distribution of Turkic peoples, shows that most Turkic peoples share considerable proportion of their genome with their geographic neighbors, supporting the elite dominance model for Turkic language dispersal. We also showed that almost all the western Turkic peoples retained in their genome shared ancestry that we trace back to the SSM region. In this way, we provide genetic evidence for the Inner Asian Homeland (IAH) of the pioneer carriers of Turkic language, hypothesized earlier by others on the basis of historical data. Furthermore, because Turkic peoples have preserved SSM ancestry tracts in their genomes, we were able to perform admixture dating and the estimated dates are in good agreement with the historical period of Turkic migrations and overlapping Mongols expansion. Finally, much remains to be learned about the demographic consequences of this complex historical event and further studies will allow the disentangling of multiple signals of admixture in the human genome and fine scale mapping of the geographic origins of individual chromosomal tracts.

Materials and Methods Ethics statement All subjects signed personal informed consents and ethical committees of the institutions involved approved the study. Samples, genotyping, and quality control In total, 322 individuals from 38 populations were genotyped on different Illumina SNP arrays (all targeting > 500,000 SNPs) according to manufacturers’ specifications. Our data was combined with published data from Li et al. [20], Rasmussen et al. [21], Behar et al. [22], Yunusbayev et al. [15], Metspalu et al. [29], Fedorova et al. [23], Raghavan et al. [41], Behar et al. [42], and covered all the Turkic-speaking populations (373 individuals from 22 samples) from key regions across Eurasia and their geographic neighbors (see details about sample source in S1 Table). Individuals with more than 1.5% missing genotypes were removed from the combined dataset. Only markers with a 97% genotyping rate and minor allele frequency (MAF) > 1% were retained. The absence of cryptic relatedness corresponding to first and second degree relatives in our dataset was confirmed using King [43]. The filtering steps resulted in a dataset of 1,444 individuals remaining for downstream analyses. It is important to note that in our dataset there are 312,524 SNPs that are common for Human1M-Duo and 650k, 610k, and 550k Illumina BeadChips. Different analyses have different requirements regarding marker density and we therefore prepared two datasets. For Admixture, three-population test, and ALDER analyses that require minimum background LD, LD pruning on the combined 1M-Duo and 650k, 610k, and 550k dataset was performed. The marker set was thinned by excluding SNPs in strong LD (pairwise genotypic correlation r2 > 0.4) in a window of 1,000 SNPs, sliding the window by 150 SNPs at a time. This resulted in a dataset of 174,187 SNPs. Another dataset with a dense marker set for IBD sharing and wavelet transform admixture dating analyses was prepared. For this, the 1M-Duo genotyped samples (S1 Table) were excluded to increase the SNP overlap among remaining samples up to 515,841 markers. Genetic distances between SNPs in centiMorgans were incorporated from the genetic map generated by the HapMap project [44]. Admixture analysis We inferred population structure in our dataset using a model-based clustering method implemented in the ADMIXTURE software. Because the ADMIXTURE algorithm expects SNPs to be unlinked, we used an LD pruned marker set of 174,187 SNPs. We ran ADMIXTURE assuming 3 to 14 (K = 3 to K = 14) genetic clusters or “ancestral populations” (see S1 Fig) in 100 replicates and assessed convergence between individual runs. For low values of K, all runs arrive at the same or very similar log-likelihood scores (LLs), whereas runs using higher K values have more variable LLs. Relying on the low level of variation in LLs (LLs < 1) within a fraction (10%) of runs with the highest LLs, we assume that the global log-likelihood maximum was reached at K = 3 to K = 11 and K = 13 to K = 14 (S6 Fig). ADMIXTURE provides an assessment of the “best” K by computing a cross-validation index (CV), which estimates the predictive accuracy of the model at a given K. In our setting clustering solutions at K = 5, 6, 7, and 8 showed better predictive power than other K values (S7 Fig). In choosing which model(s) (K) to discuss further, we draw from the CV results and restrict ourselves to the Ks that likely converged at the global LL maximum for the particular model. In addition, we acknowledge that clustering solutions at different Ks may reflect the hierarchical nature of human population structure. In this study, we are interested in distinguishing regional groupings, like Northeast Asia and Europe, and possible admixture between such groups. In sum, we found that the clustering solution that met our selection criteria best was K = 8. Correlation between longitude and proportion of “eastern” genetic components Pearson’s correlation coefficient between longitude and the proportion of genetic clusters k6 and k8 was computed using the cor.test() function in the R package. Three-population test (f 3 -statistics) We constructed all possible sets of target and source populations in our dataset f 3 (target, source1, source2) and then excluded combinations that contained matching source and target populations. Altogether, 97548 combinations were prepared and subjected to the three-population test implemented in the qp3Pop software [31]. This test produces a negative f 3 -statistic together with a corresponding negative Z-score when the target population tested is admixed with a source related to the given source populations. We considered the target population to be significantly admixed when its Z-score was less than -1.64 (i.e. p-value < 0.05 for a one-tailed test) (S2 Table). Assuming that the Z-score is approximately normally distributed, we performed the Bonferroni correction as follows: α was divided by the number of tests preformed α = 0.05/97548 = 5.125682*10−7 and then a quantile function for the corrected α was computed using the qnorm() function in the R package. The Z-scores that were significant after Bonferroni correction (Z-score < -4.89) were denoted with two asterisks in S2 Table. IBD sharing analysis We used the fastIBD algorithm [45] implemented in the BEAGLE 3.3 software to detect extended chromosomal tracts (> 1 cM in length) that are IBD between pairs of individuals. We ran the fastIBD algorithm ten times with different random seeds and called IBD tracts using a modified post-processing tool ‘plus-process-fibd.py’. The original post-processing tool developed by the BEAGLE authors was modified by [32]. They added an algorithm that minimizes the number of spurious breaks and gaps introduced into long segments due to low marker density [32]. Patterns of IBD sharing between modern populations can bear information about historical events [32]. According to known history, the Turkic migrations took place roughly 600–1600 years ago. Assuming a human generation time of 30 years, the immigrant chromosomes left during Turkic migrations have passed through 600 / 30 = 20 and 1600 / 30 = 53 generations/meiosis. The mean length of a single-path IBD tract passed through ~20–53 generations, is expected to be 100 cM / (2 * 20) = 2.5 cM and 100 cM / (2 * 53) = 0.94 cM [33]. We provide these estimates only as reference, and note that the true IBD tract length distribution for a given historical event is influenced by past demography. The fastIBD method that we use to search for IBD tracts has sufficient power (~0.7–0.9) to detect chromosomal tracts of 2–3 cM in length, and importantly, close to zero false discovery rate [45]. The power to detect IBD tracts of 1 cM in length varies from 0.2 to 0.5 depending on the chosen fastIBD score threshold. We used a fastIBD score threshold of 1e-10, which, in the trade-off between power-loss and minimizing false discovery rate favors the latter, keeping it close to zero. These parameter settings fit our purposes since we are interested in estimating the relative amount of IBD sharing between populations rather than the total amount of IBD sharing. Correlation between geographic distance and IBD sharing Chromosomal tracts that were IBD between two populations were first sorted into bins (classes) based on their length: 1–2, 2–3, 3–4, 4–5 cM, and then the total length was divided within each bin (class) by sample size to obtain the average IBD sharing for each population pair tested. From here onward, we refer to this statistic as IBD sharing. It was shown previously that IBD sharing between populations decays exponentially with distance between samples [32]. To test whether IBD sharing between populations in our dataset is consistent with this distance dependent pattern, we first converted the IBD sharing statistic between populations into an IBD sharing distance using—ln(IBD sharing statistics). For each pair of populations, we then calculated geodesic distances in kilometers using the ‘‘distonearth” R function [46]. Geographic coordinates for populations in our dataset were calculated using the central point from multiple sampling locations, or when such coordinates were not available, using coordinates for the country center (where sample was collected). Geographic coordinates for HGDP populations were computed as a central point in a range of longitude and latitude values given in [47]. After obtaining matrixes of geodesic and IBD sharing distances between populations, values were standardized in each of these matrixes using the maximum values observed. The Pearson's correlation coefficient between the obtained standardized distance matrixes was computed using the cor.test() function in the R package. Identifying deviations from the distance-dependent decay pattern in IBD sharing Even if there is a statistically significant correlation between IBD sharing and geographic distance in the data, spatial patterns of recent ancestry between real populations are unlikely to follow a distance-dependent decay pattern ideally. One way to detect the departure from the expected distance-dependent decay pattern is to compute parameters that describe the relationship between IBD sharing and distance in a population set where you do not expect any deviation. These parameters can then be used to compute the expected range of IBD sharing for a given pair of populations at a given distance and report deviations, if any. In our dataset, it was difficult to define a population subset that was completely devoid of samples with departures from the expected distance-dependent decay pattern. Therefore, a comparative approach was used, in which the IBD sharing pattern in our dataset with Turkic populations and without them was compared. Because departures from the distance-dependent decay pattern may already exist in the dataset without Turkic peoples, this test determines whether Turkic peoples have more extreme departure due to, for example, a long range migration from Asia, as suggested by our ADMIXTURE analysis. Therefore, the test dataset only included western Turkic populations from the Middle East, Eastern Europe, the Caucasus, and Central Asia. The overall goal is to find systematic (correlated) differences between these western Turkic populations and their geographic neighbors. The null hypothesis was that differences between western Turkic populations and their geographic neighbors are random. To perform this analysis, sets of geographic neighbors were defined for each of the 12 western Turkic populations (See S3 Table; the same sets are used for permutation test described below; See Fig 4) and IBD sharing that these populations demonstrate with other samples in our dataset was computed. A total of 63 Eurasian populations were included in our dataset (excluding 12 western Turkic populations). Thus, for each set of geographic neighbors, a vector of 63 ordered IBD sharing values with other populations in the dataset was obtained, including self-comparisons. For each of the western Turkic populations, the same vector of IBD sharing values was computed (the Turkic population versus 63 Eurasian populations). After performing element-wise subtraction of values in the “vector for geographic neighbors” from values in the “vector for Turkic population”, differences in IBD sharing were obtained. Provided that western Turkic populations have no systematic difference with their geographic neighbors, differences are expected to occur at random; that is, a vector of 63 random values is expected (positive when greater than average, negative when lower than average, and zero when equal to average). By contrast, if some of the 63 populations in the dataset, for example, Romanians, systematically demonstrate higher IBD sharing with all the tested Turkic populations, positive values are expected for Romanians. Such nonrandom signals can be detected by computing an element-wise sum over all the vectors (for all the 12 Turkic populations) containing differences. If they are random, there will be no accumulation. We call this procedure “subtraction and accumulation” throughout the text. Element-wise addition of vectors would accumulate positive IBD sharing values for Romanians (versus 12 Turkic populations) and when such values are plotted on a geographic map, a very high signal of (due to accumulation) IBD sharing should be observed for Romanians, compared to the other 63 samples. See the schematic representation of this “subtraction and accumulation” procedure in S2 Fig. An accumulated value (IBD sharing signal) for a given population was considered high when it exceeded the 0.90 sample quantile point. Finally, this “subtraction and accumulation” procedure was repeated multiple times by replacing each of the 12 Turkic populations with randomly chosen non-Turkic neighbors from respective sets of geographic neighbors (see S3 Fig for subset of results). Doing so demonstrates the kind of results expected when the “subtraction and accumulation” procedure is done with population sets that do not have systematic differences in IBD sharing. Permutation test A permutation test was designed to verify whether the excess of IBD sharing that western Turkic populations demonstrate with SSM populations is statistically significant. IBD sharing (as described previously) between a western Turkic-speaking population and each of the three SSM populations (Tuvans, Buryats, Mongols) that show high accumulated IBD sharing was calculated. A permutation procedure was then used to test whether observed excess of IBD sharing that a given Turkic population demonstrates can be expected by chance among its non-Turkic neighbors. For each Turkic population, their geographic neighbors were pooled and 10,000 random samples of the same size were generated (as for the Turkic population tested). For each random sample, IBD sharing with the three SSM populations and Evenkis was calculated. Obtained IBD sharing statistics from permuted samples were compared to that of Turkic populations from the same region and the number of tests showing equal or higher values was divided by the total number of permutations to obtain a p-value. Admixture dating using ALDER We used ALDER [34] to test whether the gene flow from SSM area suggested by our IBD sharing analysis left detectable trace in LD pattern among the Turkic populations, and date this admixture signal. ALDER has a functionality to perform a statistical test for the presence of admixture and dating admixture signal. We tested all possible combinations of populations in our dataset to be reference group for Turkic populations and report a pair that successfully passed all the pre-test steps and has significant p-value for admixture. By choosing the pair of reference populations with the highest LD curve amplitude we report ALDER-inferred admixture date for the admixed population. Dating admixture using wavelet transform method For each Turkic-speaking population, two parental populations were selected, so that one represented local ancestry and another represented immigrant SSM ancestry. Based on our IBD sharing analysis, Tuvans were used to represent the SSM parental population for all Turkic-speaking populations. Alternative SSM ancestors were also tested (See S5 Table). The parental population that represented local ancestry was chosen among geographic neighbors. The wavelet transform method as implemented in the StepPCO software was used [35]. This software projects an admixed population, in this case the Turkic-speaking population, along a principal component separating the two chosen ancestors, the Tuvans, representing the SSM ancestor and a geographic neighbor (see S5 Table). At this step, PC coordinates for each projected Turkic population were used to infer the proportion of admixture contributed by each parental population. Given that x1 and x2 are the average PC coordinates for the first and second parental populations and x3 is the average PC coordinate for the admixed population, the proportion of ancestry contributed by the first parental population is α = (x2—x3) / (x2—x1) [48]. Accordingly, the proportion of ancestry contributed by the second parental population is (1 - α). Admixture proportions inferred at this stage were used later to find a matching simulated dataset with the same admixture proportion. The StepPCO software infers which chromosomal tracts have local or immigrant SSM ancestry and uses observed length distributions of the ancestry tracts to compute a genome-wide wavelet transform (WT) coefficient [35]. These WT coefficients for each Turkic-speaking population are compared with those of simulated samples that have matching admixture proportions and for which the admixture history is known. Samples with known admixture history were generated using a forward simulation in the SPCO software. In this forward simulation, a population with effective population size of 1,000 individuals at time T 0 receives migrants from another population and grows for 300 generations until it reaches an effective population size of 10,000 individuals. We modeled different amounts of migrants, replacing 5%, 10%, 15%, 20%, …, 75% of the recipient population. Each admixture scenario was repeated 100 times and a random sample of 20 individuals was drawn to compute WT coefficients. The WT coefficients from these 100 independent runs were used to construct a 95% confidence interval for a given admixture scenario. Simulated WT coefficients with 95% confidence interval were plotted against the known number of generations since admixture (S5 Fig). Admixture time for tested populations was obtained by comparing point estimates with the curve of simulated WT coefficients. Coalescent simulations To test how different dating methods recover signals of admixture known to have occurred during historical periods, gene flow events occurring 5, 10, 20, 30, 40, 50, and 60 generations ago were simulated in 100 repetitions. 22 chromosomes were simulated with lengths and variation in recombination rates for each based on HapMap Phase 2 build 36 genetic maps [44]. The MaCS coalescent simulator [49] was used to generate a series of historical gene flow events between two parental populations whose demographic parameters imitate that of Asian and European populations. Demographic parameters (split time, growth rate, and bottlenecks) for the two simulated parental populations were taken from the study by Schaffner et al. [50]. In this study, a series of population genetic statistics were used to fit the demographic histories of simulated populations to those observed for African, Asian, and European populations. Here, these best-fitting demographic parameters were used to simulate samples that imitate sequences drawn from Asian and European populations. Each “historical time” gene flow event represents a mixture of the two parental populations (Asian and European) at equal proportions. From each simulated admixed population a sample of 30 sequences were drawn to construct 15 genotypes that were then subjected to the same data preparation and quality control steps as for real data during admixture dating. The only difference was that we applied the LD pruning step before ALDER and SPCO dating (while we performed no LD pruning with the real dataset before SPCO analysis) in order to make the data preparation steps identical. Before doing so ALDER and SPCO performance was tested using 100 simulated datasets with and without LD pruning. Root-mean-square error (RMSE) between the true and estimated values and bias (difference between sums of true and estimated values divided by the number of observations) were computed to compare the accuracy of ALDER and SPCO inferences with and without LD pruning. While LD pruning slightly improved the accuracy of the ALDER estimates for older admixture events, it had a minimal effect on the accuracy of the SPCO estimates. Thus, in case of ALDER dating, LD pruning slightly improved the accuracy of estimates (RMSE = 2.5, bias = -0.19) compared to the analysis without LD pruning (RMSE = 5.5, bias = 0.64), See also S8A Fig. When the same datasets were analyzed with the SPCO method, we observed minimal differences in accuracy (See also S8B Fig) when analyzing the datasets with LD pruning (RMSE = 7.6 and bias = 5.5) and without LD pruning (RMSE = 8.0 and bias = 6.1). This comparative analysis also showed that the ALDER method recovers true dates with less bias than the SPCO method under default settings. We note, however, that the SPCO estimates were generally close to the true dates for older admixture events. Accession numbers The genome-wide SNP data generated for this study can be accessed through the data repository of the Estonian Biocentre (www.ebc.ee/free_data), and the raw Illumina genotype data through the data repository of the National Center for Biotechnology Information—Gene Expression Omnibus (NCBI-GEO) under accession number GSE66157.

Acknowledgments We thank the individuals who provided DNA samples for this study. We thank the Estonian Biocentre core facility and Estonian Genome Center of the University of Tartu (EGC-UT) personnel, especially J. Parik, T. Reisberg, V. Soo, and S. Smit for sample handling and conducting the autosomal genotyping. We are thankful to Georgi Hudjashov for preparing autosomal data, and to Brian Browning and Sharon Browning for their helpful discussions on our preliminary IBD sharing results. Computations were performed on resources provided by the Core Computing Unit of the Estonian Biocentre and the High Performance Computing Centre of the University of Tartu, and we thank Lauri Anton and Martin Loginov for their technical support.

Author Contributions Conceived and designed the experiments: BY MM RVi. Performed the experiments: BY MM. Analyzed the data: BY MM AV. Contributed reagents/materials/analysis tools: AV SL RVa VA EB OB ST DD PN AB HS KT SF NB IK EMi RK LD MD BM LO MV LY EK RVi. Wrote the paper: BY MM RVi TK. Coorinated and prepared samples for genotyping and managed data generation: EMe. Discussed and critically edited the manuscript: TK.