Deep ancestry of R0a

R0a’b (of which R0a forms the major part: Fig. 1; Fig. S1), which dates to ~40 ka using ML, is the sole known sister clade to the major West Eurasian haplogroup HV, with the two together comprising haplogroup R0. R0 branches directly from macro-haplogroup R, which dates to ~59 ka15. Although haplogroup R predominates amongst West Eurasians, especially Europeans, continent-specific basal branches are also found amongst South Asians, East Asians, Southeast Asians and Oceanians32. Thus whilst haplogroup R is a global non-African founder clade, R0 is primarily West Eurasian.

Figure 1 Maximum-parsimony phylogenetic tree of 202 complete mtDNA sequences belonging to haplogroup R0a. Three R0b sequences are also included. Each circle represents a mitogenome and numbers are the same as those in Table S1. Mutations are shown on the branches (relative to rCRS); they are transitions unless the base change is explicitly indicated. Suffixes indicate: transversions (to A, G, C, or T), deletions (d), heteroplasmies (R and Y) and reversions (@). Insertions are also suffixed with a dot followed by a number indicating how many bases were inserted and the inserted nucleotide/s (.1C). Recurrent mutations are underlined. The variation at np 16519, in the number of Cs at nps 309 and 315 as well as the AC indels at nps 515–522 were not included in the phylogeny. All the samples are coloured according to their geographic origin as shown in the legend. ML age estimates are reported in ka for nodes encompassing at least three mitogenomes, except for R0a5 (two mitogenomes), which is extremely rare. Full size image

R0a, dating to ~30 ka using ML (Table 1) falls into at least five major subclades, three (R0a1, R0a2’3 and R0a4) already known17,33 and two (R0a5 and R0a6) newly defined here (Fig. 1). Two further basal haplotypes (#201 and #202) are seen in Italy and Spain, respectively. Along with a third lineage basal to R0a1 known from control-region data to occur in Egypt (Fig. S1) and the distribution of the very rare R0b, these might suggest a pre-LGM Mediterranean/Near Eastern source for R0a and R0a1, 25–40 ka. Alternatively, they might represent relicts of Late Glacial or postglacial dispersals around the Mediterranean.

Table 1 Molecular divergence and age estimates (maximum likelihood and ρ) for haplogroup R0a’b and its subclades. Full size table

R0a4, R0a5 and R0a6 are all rare. A survey of the R0a5 HVS-I motif indicates a wide distribution across the Near East and Europe (Table S2) and a deep Glacial ancestry (36.9 ± 14.1 ka with HVS-I; the two mitogenomes diverge at 18.8 ± 6.6 ka). A similar assessment for haplogroup R0a6 is more difficult, because its only control-region mutation is the reversion of the 16126 transition, but its distribution appears to be mainly restricted to Pakistan (mainly but not exclusively Kalash), with Palestinian, Iranian and Italian singletons (see also Fig. S1). Given its prevalence in the Kalash, we may hope that future elucidation of this lineage may help shed light on the origins of the Kalash people.

In contrast, the frequency and distribution for R0a4 cannot be assessed from published datasets because it lacks any diagnostic control-region mutations. With this limitation in mind, Fig. 1 and Table S1 indicate that R0a4 encompasses mainly mitogenomes from Western Europe, Spain in particular, but also Iraq.

An Arabian source for the major R0a lineages

The great majority of R0a mitogenomes cluster within R0a1 and R0a2’3, dating to the LGM (~26 ka and 21 ka, respectively), each mainly represented by a single star-like subclade, R0a1a and R0a2. These subclades both coalesce to the Late Glacial: ~13 and 17 ka (Table 1). These are the two major expansion lineages in R0a, but although widespread, they are both overwhelmingly seen in Arabia, especially Yemen (Fig. 2). However, R0a1 also includes R0a1b, comprising mainly lineages from Arabia and several possibly related lineages including a Bedouin from Arabia and a Moroccan. Given that the former have an Arabian origin and the latter are also from Arab-speaking populations, that probably spread from Arabia during the Muslim conquests, the whole of R0a1 seems likely to have an Arabian origin, dating back to at least 26 ka and thus spanning the LGM. This implies that the several Iranian lineages and a single Syrian lineage within R0a1a were derived from an Arabian source. This is supported by the HVS-I network, in which Iranian lineages broadly represent a small subset of Arabian R0a1a diversity (Fig. S1). This is also the case for the few Syrian and Iraqi lineages and the single branch shared by two Druze individuals is very recently diverged. Moreover, an overall ρ estimate for Fertile Crescent lineages in the HVS-I network for R0a1a, as a simple, unbiased measure of diversity, is only 64.4% of that for Arabian lineages. Thus R0a1 most likely entered Arabia by 26 ka, with the few northern Near Eastern lineages due to recent gene flow from Arabia into the Fertile Crescent. We need to recall this when we consider the founder analysis, below.

Figure 2 Spatial frequency distribution maps of haplogroups R0a, R0a1a, R0a2b1 and R0a2b2. Dots indicate the geographical locations of the surveyed populations. Population frequencies (%) correspond to those listed in Table S2. The extremely high frequencies of R0a and R0a1a in the Socotra sample (38.5% and 24.6%, respectively) were not included in order to provide a correct representation of the much lower frequencies in the regions surrounding the island. We constructed spatial frequency distribution plots with the program Surfer 9 (Golden Software, http://www.goldensoftware.com/products/surfer). Full size image

Similarly, R0a2’3, at ~21 ka, most likely has an Arabian ancestry. R0a3 is a minor Late Glacial Arabian subclade that sits alongside a paraphyletic Iranian lineage (shared with an Egyptian in the HVS-I dataset). As with R0a1a, Iranian HVS-I lineages within the major R0a2 are broadly a subset of Arabian diversity, with a number of ancestral haplotypes at elevated frequencies (Fig. S1). This may be explained by sporadic gene flow across the Gulf, but some Iranian lineages (along with lineages found further east in Pakistan) may also represent gene flow along the maritime trading networks which intensified in the mid- to late Holocene34. There is also a subclade, R0a1a1a, dating to ~3.5 ka (part of a larger clade, R0a1a1, that is also largely restricted to Yemen, dating to 10.3 ka), associated with the settlement of the island of Socotra, which may also have been part of a wider process of increased maritime activity and exchange35.

Similarly to R0a1a, if we examine R0a2 lineages from the Levant as a potential source pool, although some are ambiguous, more than a third of the R0a Druze in the HVS-I network (Fig. S1) belong to a derived largely European subclade (R0a2r), dating to ~12 ka (younger than the Arabian expansions); one belongs to a European cluster; and several to Arabian clusters. Again, of four Syrian lineages in the database, one belongs to the European/Druze R0a2r, one to the diverse Arabian subclade R0a2f (which also includes more than a third of Iraqi lineages at its tip) and one to R0a1a7, the most frequent in Yemen according to the HVS-I network, with derived lineages in Pakistan and possibly also Oman (Fig. S1). This phylogeographic pattern is markedly distinct from that in R0a5, for example. A comparison of overall ρ in HVS-I for putative R0a2 lineages (although much less clearly distinguished in the network) shows that the ρ value for the Fertile Crescent is below (albeit closer: 95.6%) that of the Arabian lineages. Again, the best explanation is an Arabian source for the Levantine lineages, in some cases as a result of sporadic gene flow, but for the majority due to Late Glacial expansions through the Levant into Mediterranean Europe, as we discuss further below. This once again suggests a Glacial arrival in Arabia, by 26 ka, although in this case the existence of the Levantine/European R0a2r subclade may suggest that we should not completely rule out a source in a Levantine refugium and Later Glacial expansions into Arabia as an alternative possibility.

With this caveat, this overall pattern strongly suggests that R0a1 and R0a2’3 both entered Arabia before or around the LGM and that the R0a1b/R0a1* and R0a3/R0a2’3* lineages are relicts that were not caught up to the same extent in the Late Glacial expansions that followed the LGM. This conclusion is further supported by the Bayesian skyline plots (BSPs) and reciprocal founder analyses detailed below.

Expansions of R0a1 and R0a2’3 lineages

The conclusion is strengthened by the distribution of the remaining lineages within each subclade. R0a1a encompasses at least eight major subclades (R0a1a1–8; R0a1a5–8 newly reported) and many paraphyletic lineages. Levantine lineages belong mainly to Negev desert Bedouin and Palestinians. The Bedouin have an Arabian Peninsula ancestry and genome-wide PCA and ADMIXTURE analyses indicate that Palestinians too are more similar to Arabian populations than to other Levantine populations and likely have substantial Arabian ancestry36,37. There is a single small Ethiopian subclade, R0a1a2, dating to ~5 ka but diverging directly from the R0a1a root and several sporadic singleton Horn lineages, but the vast majority of African R0a lineages fall within R0a2.

The larger R0a2 dates to ~16 ka, with 18 derived subclades which coalesce mainly to the Late Glacial, ~13 and 15 ka (Table 1). The Bølling-Allerød interstadial began ~14.7 ka38 and is associated with de-glaciation in Europe and a wet phase in the Near East/Arabia, which might have facilitated movements of hunter-gatherers into previously arid areas39. There are two major Eastern African subclades, R0a2b and R0a2g, dating to ~13 and ~11 ka respectively and several minor ones, one of similar age and another of which dates to ~4 ka but again diverges basally from R0a2. There is also a major Late Glacial subclade, R0a2r, found in southern Europeans but with two basal Druze lineages (from Israel and Lebanon); and several very minor subclades pointing to dispersals into Eastern Europe and Iran/Pakistan.

The BSPs (Fig. 3) show that these coalescences correspond to two major phases of population growth amongst R0a lineages in both the Late Glacial – the Bølling-Allerød interstadial (R0a2) – and the immediate postglacial, after the Younger Dryas (R0a1a). The BSP for R0a as a whole points to a major episode of ~12-fold growth from ~16 ka until ~10 ka, with a more recent episode of ~20-fold growth at ~3 ka. The separate plots show that whilst the growth of R0a2 overlaps with R0a overall, R0a1a was involved in a subsequent population expansion, in the early postglacial warming period following the Younger Dryas glacial relapse, ~11.5 ka. The finding of distinct demographic histories for R0a1a and R0a2 suggests that they may at one time have characterized different populations, possibly even dispersing from separate glacial refugial areas.

Figure 3 Bayesian skyline plots (BSPs) of haplogroups R0a, R0a1a and R0a2. The thick solid line is the median estimate and the shading shows the 95% highest posterior density limits. The time axis is limited to 25 ka, beyond which the curves remain flat. Hypothetical effective population sizes through time are based on the mitogenomes listed in Table S1. Full size image

BSPs based on geographic origin (Fig. S2) confirmed a primary Bølling-Allerød expansion, with an additional expansion restricted to the Arabian Peninsula ~3 ka (shadowed in Eastern Africa). The plots also suggest that the earliest major signal of Late Glacial expansion was in Arabia, beginning ~17 ka, rather than in the Fertile Crescent (~14 ka), once again supporting an Arabian source. There is no independent expansion signature in Eastern Africa.

Major dispersal episodes: founder analysis

In order to date and quantify the main dispersal episodes, we performed a founder analysis on the mitogenome data. This identifies “founder sequences” shared between two populations as potential evidence for gene flow between the two populations. In this case, however, this poses a problem, since we have seen above that we cannot uniquely identify a source population and that most if not all of the Levantine and Iranian lineages in the major subclades are likely due to subsequent gene flow. (This is almost certainly the case also for most of the Mediterranean and North African lineages within R0a1 and R0a2’3.) Nevertheless, we performed the analysis assuming a northern source, in order to provide the most conservative estimate for the age of Arabian lineages. Although this assumption almost certainly doesn’t hold for R0a1a and probably also for R0a2, the analysis can still provide a clear picture of the main expansion episodes, to complement the skyline plots.

We therefore assumed a source in the Fertile Crescent, including the Levant and Iraq, both with and without Iran, in order to explore further the pattern in Arabia and to quantify and date subsequent dispersals into the Horn of Africa, Europe and South Asia, including Arabia in the source when assessing dispersals into Eastern Africa (Tables S3–S9, Fig. 4). We included Levantine Bedouin and Palestinian lineages as part of the Arabian sample, as discussed above.

Figure 4 Founder analysis of R0a. Probabilistic distribution of founder clusters across migration times, with time scanned at 200 year intervals from 0 to 50 ka, using f1 (blue lines) and f2 (red lines) criteria. (A) from the Fertile Crescent, Caucasus, Iran and the Arabian Peninsula to Eastern Africa; (B) from the Fertile Crescent and Caucasus to Arabian Peninsula and Eastern Africa; (C) from the Fertile Crescent and Caucasus to the Arabian Peninsula; (D) from the Arabian Peninsula to the Fertile Crescent, Iran and Caucasus; (E) from the Arabian Peninsula to the Fertile Crescent and Caucasus; (F) from the Fertile Crescent, Iran, North Africa, the Arabian Peninsula and Caucasus to India and Pakistan and (G) from the Fertile Crescent, Caucasus, Iran, North Africa and the Arabian Peninsula to Europe. Full size image

First, we show Eastern Africa alone as the sink (Fig. 4A and Table S3), with the whole of Southwest Asia as the source. Here there is no Late Glacial peak, but rather a clear signal right at the start of the Holocene with both criteria: 11.8 ka with f2 and 10.8 with f1. With f2, this is the sole signal, but with f1 there is a second, more recent peak at 2.8 ka. The difference occurs in R0a2b, which is classed as a single African founder by the f2 criterion, whereas R0a2b2 is classed as a distinct founder dating to 2.9 ka with f1. This lineage has been elevated to high frequency (10.3–12.5%, the most frequent lineage) in Ethiopian Jews against a genome-wide background that is almost identical to other Ethiopians and it is not seen in Yemeni Jews, where an Arabian lineage within R0a2c is seen at even higher frequency22,40 instead. Because of this, despite the superficial confirmation of the ~3 ka migration inferred from autosomal studies, we should be cautious of taking the f1 result at face value. It may be that this population has subsequently experienced gene flow back towards the Levant and that this is the reason for identifying the migration with f1 that is screened out with the more stringent f2. However, given the inferences of substantial later northwards gene flow discussed above, we consider f2 the more plausible criterion for this dataset, at least regarding the settlement of Arabia. Nevertheless, some gene flow ~3 ka is possible, especially given the strong growth signal around this time in the Arabian BSP and may also be indicated by mtDNA haplogroup HV1 (see Discussion).

We next show the results when Eastern Africa and Arabia are combined into a single sink population (Fig. 4B and Table S4). The f2 criterion indicates a single Late Glacial expansion at ~15.4 ka, involving all R0a lineages. The f1 criterion distinguishes an additional more recent, postglacial expansion for R0a1a, ~11.0 ka, but the above discussion has explained why an additional migration is an unlikely scenario in practice. It does highlight, however, that further expansion, involving R0a1a in particular, took place in the postglacial, as also shown in the skyline plots. There is no sign under either criterion of the more recent dispersal at ~3 ka, confirming that, if it occurred at all (and involved R0a), its source was within Arabia and not in the Fertile Crescent.

We next show the results with Arabia alone as the sink, with the Fertile Crescent (excluding Iran) as the source (Fig. 4C and Table S5). Here again we see the major dispersal with the f2, ~15.6 ka. This represents our best estimate for the timing of the Late Glacial expansion of R0a. With f1 we see again both an even earlier Late Glacial peak at 17.6 ka and an additional episode at ~10.0 ka.

The reciprocal founder analysis, assuming Arabia as source and the Fertile Crescent as sink, including the Levant, Iraq and Iran (Fig. 4D and Table S6), shows a very slight early Holocene peak in f2 and major peaks towards the present for both f1 and f2. The picture is similar whether or not Palestinians are included within the Arabian source (not shown). Since the peaks are much more recent when Arabia is the source, this implies that any dispersals from Arabia towards the Fertile Crescent must have been much more recent than dispersals in the opposite direction. An analysis that excludes Iran (Fig. 4E and Table S7) differs in detail, yet retains the general features of more recent Holocene peaks especially towards the present for both f1 and f2. These results re-emphasise that the Fertile Crescent R0a variation seen today cannot be the main source for much of the diversity in Arabia, again confirming that Arabia is the most ancient reservoir of R0a variation. This in turn supports the arguments given above that the founder estimates for Arabia are in fact most likely expansion times within the Peninsula rather than dispersals from a Levantine refugium in the north.

Finally, we tested the migrations to South Asia (Fig. 4F and Table S8) and Europe (Fig. 4G and Table S9). As for the Horn of Africa and unlike for Arabia, we can safely interpret these results straightforwardly in terms of dispersals from an Arabian source. The results of the former shows a small peak ~7.8 ka with both f1 and f2 criteria, based on very few sequences and a stronger signal ~2 ka with f1, corresponding to R0a6. The mitogenomes yielding the ~2 ka signal mostly belong to the Kalash community, which is very isolated and carry low diversity of a number of mtDNA lineages of west Eurasian origin41. The 2 ka signal transposes to ~30 ka with f2, but examination of the tree shows clearly that this is an artefact: the lack of additional lineages deriving from the f2 founder candidate in South Asia, the low diversity within the Kalash and the presence of a Palestinian lineage in the clade, all point to the more recent introduction of the rare R0a6, suggesting that it may have been insufficiently sampled in Southwest Asia.

The results for Europe also suggest a primary dispersal into Southeast and Mediterranean Europe at the end of the Pleistocene/early Holocene, mainly involving R0a2r, with the signal a little earlier with f2 than f1. This may have been via a Levantine refugium, given the presence of basal Druze lineages in the cluster (and a Syrian in the HVS-I data). It recalls the patterns detected in a much larger fraction of haplogroup J and T lineages that dispersed from an inferred Levantine refugium along the Mediterranean after the LGM42. Some lineages may have dispersed later in the Holocene, but this is unclear given the small sample size (R0a occurs amongst Europeans at a rate of only 0.8%).