Genetic differentiation and diversity in the ancient Levant

A total of 20 Peqi’in samples appear to be unrelated to each other to the limits of our resolution (that is, genetic analysis suggested that they were not first, second, or third degree relatives of each other), and we used these as our analysis set. Taking advantage of the new data point added by the Peqi’in samples, we began by studying how genetic differentiation among Levantine populations changed over time. We replicate previous reports of dramatic decline in genetic differentiation over time in West Eurasia24, observing a median pairwise F ST of 0.023 (range: 0.009–0.061) between the Peqi’in samples (abbreviation: Levant_ChL) and other West Eurasian Neolithic and Chalcolithic populations, relative to a previously reported median pairwise F ST of 0.098 (range: 0.023–0.153) observed between populations in pre-Neolithic periods, 0.015 (range: 0.002–0.045) in the Bronze Age periods, and 0.011 (range: 0–0.046) in present-day West Eurasian populations24. Thus, the collapse to present-day levels of differentiation was largely complete by the Chalcolithic (Supplementary Figure 1).

We also observe an increase in genetic diversity over time in the Levant as measured by the rate of polymorphism between two random genome sequences at each SNP analyzed in our study. Specifically, the Levant_ChL population exhibits an intermediate level of heterozygosity relative to the earlier and later populations (Fig. 2).

Fig. 2 Genetic diversity in the ancient Levant. Heterozygosity increases in ancient Levantine populations over time. The estimated statistic ± 3 standard errors is indicated Full size image

Both the increasing genetic diversity over time, and the reduced differentiation between populations as measured via F ST , are consistent with a model in which gene flow reduced differentiation across groups while increasing diversity within groups.

Genetic affinities of the individuals of Peqi’in Cave

To obtain a qualitative picture of how these individuals relate to previously published ancient DNA and to present-day people, we began by carrying out principal component analysis (PCA)33. In a plot of the first and second principal components (Fig. 3a), the samples from Peqi’in Cave form a tight cluster, supporting the grouping of these individuals into a single analysis population (while we use the broad name “Levant_ChL” to refer to these samples, we recognize that they are currently the only ancient DNA available from the Levant in this time period and future work will plausibly reveal genetic substructure in Chalcolithic samples over the broad region). The Levant_ChL cluster overlaps in the PCA with a cluster containing Neolithic Levantine samples (Levant_N), although it is slightly shifted upward on the plot toward a cluster corresponding to samples from the Levant Bronze Age, including samples from 'Ain Ghazal, Jordan (Levant_BA_South) and Sidon, Lebanon (Levant_BA_North). The placement of the Levant_ChL cluster is consistent with a previously observed pattern whereby chronologically later Levantine populations are shifted towards the Iran Chalcolithic (Iran_ChL) population compared to earlier Levantine populations, Levant_N (Pre-Pottery and Pottery Neolithic agriculturalists from present-day Israel and Jordan) and Natufians (Epipaleolithic hunter-gatherers from present-day Israel)24.

Fig. 3 Genetic structure of analyzed individuals. a Principal component analysis of 984 present-day West Eurasians (shown in gray) with 306 ancient samples projected onto the first two principal component axes and labeled by culture. b ADMIXTURE analysis of 984 and 306 ancient samples with K = 11 ancestral components. Only ancient samples are shown Full size image

ADMIXTURE model-based clustering analyses34 produced results consistent with PCA in suggesting that individuals from the Levant_ChL population had a greater affinity on average to Iranian agriculturalist-related populations than was the case for earlier Levantine individuals. Figure 3b shows the ADMIXTURE results for the ancient individuals assuming K = 11 clusters (we selected this number because it maximizes ancestry components that are correlated to ancient populations from the Levant, from Iran, and European hunter-gatherers)24. Like all Levantine populations, the primary ancestry component assigned to the Levant_ChL population, shown in blue, is maximized in earlier Levant_N and Natufian individuals. ADMIXTURE also assigns a component of ancestry in Levant_ChL, shown in green, to a population that is generally absent in the earlier Levant_N and Natufian populations, but is present in later Levant_BA_South and Levant_BA_North samples. This green component is also inferred in small proportions in several samples assigned to the Levant_N, but there is not a clear association to archaeological location or date, and these individuals are not significantly genetically distinct from the other individuals included in Levant_N by formal testing, and thus we pool all Levant_N for the primary analyses in this study (Supplementary Note 1)24.

Population continuity and admixture in the Levant

To determine the relationship of the Levant_ChL population to other ancient Near Eastern populations, we used f-statistics35 (see Supplementary Note 2 for more details). We first evaluated whether the Levant_ChL population is consistent with descending directly from a population related to the earlier Levant_N. If this was the case, we would expect that the Levant_N population would be consistent with being more closely related to the Levant_ChL population than it is to any other population, and indeed we confirm this by observing positive statistics of the form f 4 (Levant_ChL, A; Levant_N, Chimpanzee) for all ancient test populations, A (Fig. 4a). However, Levant_ChL and Levant_N population do not form a clade, as when we compute symmetry statistics of the form f 4 (Levant_N, Levant_ChL; A, Chimpanzee), we find that the statistic is often negative, with Near Eastern populations outside the Levant sharing more alleles with Levant_ChL than with Levant_N (Fig. 4b). We conclude that while the Levant_N and Levant_ChL populations are clearly related, the Levant_ChL population cannot be modeled as descending directly from the Levant_N population without additional admixture related to ancient Iranian agriculturalists. Direct evidence that Levant_ChL is admixed comes from the statistic f 3 (Levant_ChL; Levant_N, A), which for some populations, A, is significantly negative indicating that allele frequencies in Levant_ChL tend to be intermediate between those in Levant_N and A—a pattern that can only arise if Levant_ChL is the product of admixture between groups related, perhaps distantly, to Levant_N and A35. The most negative f 3 - and f 4 -statistics are produced when A is a population from Iran or the Caucasus. This suggests that the Levant_ChL population is descended from a population related to Levant_N, but also harbors ancestry from non-Levantine populations related to those of Iran or the Caucasus that Levant_N does not share (or at least share to the same extent).

Fig. 4 Genetic characteristics of the Levant_ChL. a The statistic f 4 (Levant_ChL, A; Levant_N, Chimpanzee) demonstrates a close relationship between the Neolithic and Chalcolithic Levant populations, as the Levant Neolithic shares more alleles with the Levant Chalcolithic than with any other populations. b The statistic f 4 (Levant_N, Levant_ChL; A, Chimpanzee) shows an asymmetrical relationship between Levant_N and Levant_ChL and other ancient West Eurasian populations. The statistic is most negative for populations from Iran and the Caucasus, indicating that Levant_ChL shares more alleles with them than does Levant_N. c The statistic f 3 (Levant_ChL; Levant_N, A) tests for signals of admixture in Levant_ChL. Negative f 3 -statistics indicate that the Levant_ChL population is admixed. Populations from Iran and the Caucasus produce the most negative statistics. The estimated statistic ± 3 standard errors is indicated Full size image

The ancestry of the Levant Chalcolithic people

We used qpAdm as our main tool for identifying plausible admixture models for the ancient populations for which we have data (see Supplementary Note 3 for more details)36.

The qpAdm method evaluates whether a tested set of N “Left” populations—including a “target” population (the population whose ancestry is being modeled) and a set of N − 1 additional populations—are consistent with being derived from mixtures in various proportions of N − 1 ancestral populations related differentially to a set of outgroup populations, referred to as “Right” populations. For all our analyses, we use a base set of 11 “Right” outgroups referred to collectively as “09NW”—Ust_Ishim, Kostenki14, MA1, Han, Papuan, Onge, Chukchi, Karitiana, Mbuti, Natufian, and WHG—whose value for disentangling divergent strains of ancestry present in ancient Near Easterners has been documented in Lazaridis et al.24 (for some analyses we supplement this set with additional outgroups). To evaluate whether the “Left” populations are consistent with a hypothesis of being derived from N − 1 sources, qpAdm effectively computes all possible statistics of the form f 4 (Left i , Left j ; Right k , Right l ), for all possible pairs of populations in the proposed “Left” and “Right” sets. It then determines whether all the statistics can be written as a linear combination of f 4 -statistics corresponding to the differentiation patterns between the proposed N − 1 ancestral populations, appropriately accounting for the covariance of these statistics and computing a single p value for fit based on a Hotelling T-squared distribution36. For models that are consistent with the data (p > 0.05), qpAdm estimates proportions of admixture for the target population from sources related to the N − 1 ancestral populations (with standard errors). Crucially, qpAdm does not require specifying an explicit model for how the “Right” outgroup populations are related.

We first examined all possible “Left” population sets that consisted of Levant_ChL along with one other ancient population from the analysis dataset. Testing a wide range of ancient populations, we found that p values for all possible Left populations were below 0.05 (Supplementary Data 2), showing that Levant_ChL is not consistent with being a clade with any of them relative to the “Right” 09NW outgroups. We then considered models with “Left” population sets containing Levant_ChL along with two additional ancient populations, which corresponds to modeling the Levant_ChL as the result of a two-way admixture between populations related to these two other ancient populations. To reduce the number of hypotheses tested, we restricted the models to pairs of source populations that contain at least one of the six populations that we consider to be the most likely admixture sources based on geographical and temporal proximity: Anatolia_N, Anatolia_ChL, Armenia_ChL, Iran_ChL, Iran_N, and Levant_N. Again, we find no plausible two-way admixture models using a p > 0.05 threshold (Supplementary Figure 2 and Supplementary Data 3). Finally, we tested possible three-way admixture events, restricting to triplets that contain at least two of the six most likely admixture sources. Plausible solutions at p > 0.05 are listed in Table 1 (full results are reported in Supplementary Figure 3 and Supplementary Data 4).

Table 1 Plausible models of Levant_ChL as a mixture of three sources Full size table

We found multiple candidates for three-way admixture models, always including (1) Levant_N (2) either Anatolia_N or Europe_EN and (3) either Iran_ChL, Iran_N, Iran_LN, Iran_HotuIIIb or Levant_BA_North. These are all very similar models, as Europe_EN (early European agriculturalists) are known to be genetically primarily derived from Anatolian agriculturalists (Anatolia_N)31, and Levant_BA_North has ancestry related to Levant_N and Iran_ChL26. To distinguish between models involving Anatolian Neolithic (Anatolia_N) and European Early Neolithic (Europe_EN), we repeated the analysis including additional outgroup populations in the “Right” set that are sensitive to the European hunter-gatherer-related admixture present to a greater extent in Europe_EN than in Anatolia_N (Supplementary Figure 4a)31 (thus, we added Switzerland_HG, SHG, EHG, Iberia_BA, Steppe_Eneolithic, Europe_MNChL, Europe_LNBA to the “Right” outgroups; abbreviations in Supplementary Table 2). We found that only models involving Levant_N, Anatolia_N, and either Iran_ChL or Levant_BA_North passed at p > 0.05 (Table 1). To distinguish between Iran_ChL and Levant_BA_North, we added Iran_N to the outgroup set (for a total of 19 = 11 + 8 outgroups) (Supplementary Figure 4b). Only the model involving Iran_ChL remained plausible. Based on this uniquely fitting qpAdm model we infer the ancestry of Levant_ChL to be the result of a three-way admixture of populations related to Levant_N (57%), Iran_ChL (17%), and Anatolia_N (26%).

The ancestry of late Levantine Bronze Age populations

It was striking to us that previously published Bronze Age Levantine samples from the sites of 'Ain Ghazal in present-day Jordan (Levant_BA_South) and Sidon in present-day Lebanon (Levant_BA_North) can be modeled as two-way admixtures, without the Anatolia_N contribution that is required to model the Levant_ChL population24,26. This suggests that the Levant_ChL population may not be directly ancestral to these later Bronze Age Levantine populations, because if it were, we would also expect to detect an Anatolia_N component of ancestry. In what follows, we treat Levant_BA_South and Levant_BA_North as separate populations for analysis, since the symmetry statistic f 4 (Levant_BA_North, Levant_BA_South; A, Chimp) is significant for a number test populations A (|Z| ≥ 3) (Supplementary Data 5), consistent with the different estimated proportions of Levant_N and Iran_ChL ancestry reported in24,26.

To test the hypothesis that Levant_ChL may be directly ancestral to the Bronze Age Levantine populations, we attempted to model both Levant_BA_South and Levant_BA_North as two-way admixtures between Levant_ChL and every other ancient population in our dataset, using the base 09NW set of populations as the “Right” outgroups. We also compared these models to the previously published models that used the Levant_N and Iran_ChL populations as sources (Table 2; Supplementary Figure 5; Supplementary Data 6). In the case of Levant_BA_South from 'Ain Ghazal, Jordan, multiple models were plausible, and thus we returned to the strategy of adding additional “Right” population outgroups that are differentially related to one or more of the “Left” populations (specifically, we added various combinations of Armenia_EBA, Steppe_EMBA, Switzerland_HG, Iran_LN, and Iran_N). Only the model including Levant_N and Iran_ChL remains plausible under all conditions. Thus, we can conclude that groups related to Levant_ChL contributed little ancestry to Levant_BA_South.

Table 2 Modeling Levant_BA_South and Levant_BA_North as a mixture of Levant_ChL and an ancient population, A Full size table

We observe a qualitatively different pattern in the Levant_BA_North samples from Sidon, Lebanon, where models including Levant_ChL paired with either Iran_N, Iran_LN, or Iran_HotuIIIb populations appear to be a significantly better fit than those including Levant_N + Iran_ChL. We largely confirm this result using the “Right” population outgroups defined in Haber et al.26 (abb. Haber: Ust_Ishim, Kostenki14, MA1, Han, Papuan, Ami, Chuckhi, Karitiana, Mbuti, Switzerland_HG, EHG, WHG, and CHG), although we find that the specific model involving Iran_HotuIIIb no longer works with this “Right” set of populations. Investigating this further, we find that the addition of Anatolia_N in the “Right” outgroup set excludes the model of Levant_N + Iran_ChL favored by Haber et al.26. These results imply that a population that harbored ancestry more closely related to Levant_ChL than to Levant_N contributed to the Levant_BA_North population, even if it did not contribute detectably to the Levant_BA_South population.

We obtained additional insight by running qpAdm with Levant_BA_South as a target of two-way admixture between Levant_N and Iran_ChL, but now adding Levant_ChL and Anatolia_N to the basic 09NW “Right” set of 11 outgroups. The addition of the Levant_ChL causes the model to fail, indicating that Levant_BA_South and Levant_ChL share ancestry following the separation of both of them from the ancestors of Levant_N and Iran_ChL. Thus, in the past there existed an unsampled population that contributed both to Levant_ChL and to Levant_BA_South, even though Levant_ChL cannot be the direct ancestor of Levant_BA_South because, as described above, it harbors Anatolia_N-related ancestry not present in Levant_BA_South.

Genetic heterogeneity in the Levantine Bronze Age

We were concerned that our finding that the Levant_ChL population was a mixture of at least three groups might be an artifact of not having access to samples closely related to the true ancestral populations. One specific possibility we considered is that a single ancestral population admixed into the Levant to contribute to both the Levant_ChL and the Levant_BA_South populations, and that this was an unsampled population on an admixture cline between Anatolia_N and Iran_ChL, explaining why qpAdm requires three source populations to model it. To formally test this hypothesis, we used qpWave36,37,38, which determines the minimum number of source populations required to model the relationship between “Left” populations relative to “Right” outgroup populations. Unlike qpAdm, qpWave does not require that populations closely related to the true source populations are available for analysis. Instead it treats all “Left” populations equally, and attempts to determine the minimum number of theoretical source populations required to model the “Left” population set, relative to the “Right” population outgroups. Therefore, we model the relationship between Levant_N, Levant_ChL, and Levant_BA_South as “Left” populations, relative to the 09NW “Right” outgroup populations (Table 3). We find that a minimum of three source populations continues to be required to model the ancestry of these Levantine populations, supporting a model in which at least three separate sources of ancestry are present in the Levant between the Neolithic, Chalcolithic, and Bronze Age.

Table 3 Determining the number of streams of ancestry in the Levant Full size table

We applied qpWave again, replacing Levant_ChL with Levant_BA_North, and found that the minimum number of source populations is only two. However, when we include the Levant_ChL population as an additional outgroup, three source populations are again required. This suggests that in the absence of the data from Levant_ChL there is insufficient statistical leverage to detect Anatolian-related ancestry that is truly present in admixed form in the Levant_BA_North population (data from the Levant_ChL population makes it possible to detect this ancestry). This may explain why Haber et al.26 did not detect the Anatolian Neolithic-related admixture in Levant_BA_North.

Biologically important mutations in the Peqi’in population

This study nearly doubles the number of individuals with genome-wide data from the ancient Levant. Measured in terms of the average coverage at SNPs, the increase is even more pronounced due to the higher quality of the data reported here than in previous studies of ancient Near Easterners24,26. Thus, the present study substantially increases the power to analyze the change in frequencies of alleles known to be biologically important.

We leveraged our data to examine the change in frequency of SNP alleles known to be related to metabolism, pigmentation, disease susceptibility, immunity, and inflammation in the Levant_ChL population, considered in relation to allele frequencies in the Levant_N, Levant_BA_North, Levant_BA_South, Anatolia_N and Iran_ChL populations and present-day pools of African (AFR), East Asian (EAS), and European (EUR) ancestry in the 1000 Genomes Project Phase 3 dataset39 (Supplementary Data 7).

We highlight three findings of interest. First, an allele (G) at rs12913832 near the OCA2 gene, with a proven association to blue eye color in individuals of European descent40, has an estimated alternative allele frequency of 49% in the Levant_ChL population, suggesting that the blue-eyed phenotype was common in the Levant_ChL.

Second, an allele at rs1426654 in the SLC24A5 gene which is one of the most important determinants of light pigmentation in West Eurasians41 is fixed for the derived allele (A) in the Levant_ChL population suggesting that a light skinned phenotype may have been common in this population, although any inferences about skin pigmentation based on allele frequencies observed at a single site need to be viewed with caution42.

Third, an allele (G) at rs6903823 in the ZKSCAN3 and ZSCAN31 genes which is absent in all early agriculturalists reported to date (Levant_N, Anatolia_N, Iran_N) and that has been argued to have been under positive selection by Mathieson et al.31, occurs with an estimated frequency of 20% in the Levant_ChL, 17% in the Levant_BA_South, and 15% in the Iran_ChL populations, while it is absent in all other populations. This suggests that the allele was rising in frequency in Chalcolithic and Bronze Age Near Eastern populations at the same time as it was rising in frequency in Europe.