Of the screened 2768 mtDNAs from 24 tribes of India the macrohaplogroup M accounted for 69.39 %, which is consistent with the earlier reports [5, 11, 14]. The frequency distribution of macrohaplogroup M varies significantly (P <0.0001) among studied tribes with a cline towards southern and eastern regions of India as shown in Table 1 and Figure 1. In tribes (MaThakur, KaThakur, Kathodi, Katkari) of western region, macrohaplogroup M frequency is significantly low (~50% or less; P <0.011) as compared to the other studied regions of India. Unexpectedly Dungri Bhil representing the north-westernmost region shows a high frequency of M (76.1%) as compared to its other western counterparts.

Table 1 Sampling details and mtDNA lineage distribution in India. Full size table

Figure 1 Map of the Indian subcontinent indicating approximate locations of studied populations and mtDNA haplogroup distribution. '*' Approximate location of the populations studied by Sun et al. [7], the mtDNA sequences of which were used in this study. Full size image

For the earliest settlers' component among the studied tribes, 1810 samples of macrohaplogroup M were screened for the motif that confirms haplogroup M2 within M as described in methods. Our results indicate that M2 is completely absent among the eight tribes of northeast India, expect one M2 in Sonowal Kachari. Avoiding northeast tribes, the M2 haplogroup frequency is about 13.86 % among the studied tribes. Its frequency is ~10 to ~20 % in tribes of western and central India. The frequency declines gradually to farther north and east. In southern region tribes, Betta Kuruba shows highest frequency (i.e. 39.13%) whereas the adjacent Jenu Kuruba tribe shows frequency of only 7.02 %. The distribution of subclade M2b varies greatly from complete absence among Indo-European speakers of western and central India to as high as 35.65 % among Betta Kuruba. Irrespective of region, its frequency is high (>50% of total M2) in all Dravidian speakers, except Madia tribe of central region whose linguistic affiliation is not very clear. Similarly it's frequency is high in Korku, an Austro-Asiatic tribe of central India. In eastern region M2b frequency remains low (<50% of M2).

Defining the M2 substructure

The reconstructed phylogenetic tree based on our newly sequenced 72 mtDNAs of M2 haplogroup and 4 additional M2 complete sequences from the literature [7] is given in Figure 2. Out of the four defining mutations of macrohaplogroup M, one transition at nucleotide position (np) 14783 shows reversion in one of our samples. Besides the commonly occurring 16319 transition, M2 in our samples is defined by the motif 447G-1780-8502-11083-15670-16274 as also described in Kivisild et al., Rajkumar et al., Sun et al. and Thangaraj et al. [6, 7, 12, 9]. Though one major branch in our tree lack mutation at np 16274 but due to its presence in most M2 samples of this study, as well reported elsewhere [6, 7, 11] we considered this mutation as a basal polymorphism of M2 as suggested in Sun et al. [7] and lack of the 16274 variant in some samples [[6, 12], this study] indicates a back-mutation event. Similarly lack of mutation at np 11083 in one of our samples is also treated as reversion event.

Figure 2 Phylogenetic reconstruction of 76 complete mtDNAs of M2 Lineage. Mutations were scored relative to the rCRS [58]. Sample details and population codes has been given in Table 1. Four additional complete mtDNA sequence of M2 lineage (labeled as R102, T3, T11 and T27) are acquired from published sources [7] has been used for tree reconstruction. Suffixes A, C, G, and T indicate transversions, "d" signifies a deletion and a plus sign (+) an insertion; recurrent mutations are underlined. The prefix "@" indicates back mutation. The coalescent estimates were calculated as per [16] and [17] presented as bold and Italic respectively. Full size image

The M2 tree shows an initial deep split into two sister clades M2a and M2b. No third clade, as indicated in Rajkumar et al. [12] has been found. The clade M2a is defined by transition at np 7961, 12810 and contains three independent basal branches M2a1, M2a2 and M2a3, in contrast to the earlier reports [6, 7, 11] where M2a defining motif largely constitute mutations of its sub-braches. M2a1 is defined by the motif 204-5252-8396-9758-16270-16352, in which transition at np 8396 show parallelism in two samples of M2a2 branch and transition at np 16352 shows a reversion event. M2a2 is defined by the motif of four diagnostic 7702-11041A-12657-13708 and two recurrent 16240C-16311 mutations. The branch M2a3 is defined by the motif of one recurrent np 146 and three specific 5426-5774-7762 mutations. The further divergence within these branches of M2a exhibit probable pattern of more shared haplotypes within populations of geographic proximity followed by population specific haplotypes and a few shared haplotypes among geographically apart populations.

Unlike M2a, M2b instead of early branching represented by a single deep root defined by the motif 152-182-195-522,523d-1453-2831T-3630-5744-6647-9899-13254-14766-16183C-16189-16193+C-16320 which, of late shows branching pattern similar to the sub-branches of M2a. The M2b1 defined by the transition at np 6260-5420 harbour population of eastern region. Whereas, M2b2 defined by transition at np 16295, harbours Dravidians. Other braches within M2b are more or less population specific. In this study, spread of M2b by enlarge restricted to Dravidians and tribes of eastern region. The root of M2b in our tree differs in two positions to the earlier definition of Sun et al. [7] i.e. transition at np 182 is present in all of our M2b samples so we treated this as basal mutation and lack of this in one sample of Sun et al. [7] could be better explained by reversion event, second our all M2b samples has poly 'A' at np 16180–16182 and twelve 'C's thereafter. Hence in our tree an additional 'C' at np 16184–16193 has been treated as insertion at np 16193, than transversion (A16182C) reported by Sun et al. [7].

Age estimates and Phylogenetic implications

Coalescent age estimates were calculated by Rho (ρ) statistics [15] using two different mutation rates [16] and [17] shows a marginal time difference when standard deviation is taken into account, the later has been considered because of robustness in view of natural selection [17]. The average sequence divergence of the 76 M2 coding-region sequences from the root of M2 calculated as per [17] corresponds to a coalescence time estimate of 36.5 ± 1.6 thousand years (ky). The founder age estimate for Indian mtDNA lineages using M2 data, 50.0 ± 1.5 ky is well within the lower bound range of earlier estimates (i.e. sometime before 50 kyBP) of modern human dispersal into Arabia and southern Asia [1, 2, 4, 17–21], and perhaps more close to the estimates of [17].

The two clades of M2 show differential branching patterns. M2a with coalescent age 21.6 ± 2.3 ky splits into its three deep rooting branches M2a1, M2a2 and M2a3. M2a2 is specific to Kathodi/Katkari tribe, whereas M2a1 and M2a3 encompass almost all the studied tribes. M2a1, M2a2 and M2a3 show coalescent estimates of 7 to 9 ky. The clade M2b doesn't show branching event earlier than estimated coalescence time of 12.6 ± 2.8 ky. In our samples we could not find M2b among Indo-European speakers of west and central India. The Dravidian speaking tribes of south extending up to central India and tribes of eastern region irrespective of linguistic affiliation, harbour both clades (i.e. M2a and M2b) of M2, presenting a time depth of ~37 ky.

Diversity indices

Diversity indices and demographic parameters estimated for studied tribes are given in Table 2. The M2 lineage, haplotype diversity among Indian tribes ranged from 0.40 to 1.00 and nucleotide diversity from 0.0001 to 0.002. Though four geographical regions of India did not differ significantly (Mann-Whitney U-test) in haplotype diversity it was comparatively higher in west (0.90–1.00) followed by central (0.83–1.00), eastern (0.83–1.00) and southern tribes (0.40–1.00). Nucleotide diversity in east (0.0010–0.0019) was significantly higher than west (0.0001–0.0009; Z = 2.24, P = 0.025) and central tribes (0.00016–0.0011; Z = 2.65, P = 0.039), intermediate nucleotide diversity values were observed in south India (0.0006–0.002), they were not significantly different from west and central India (Z = 1.71; P = 0.087) or east India (Z = 0.44; P = 0.662). These patterns of genetic diversity were further strengthened by the analysis of mean pairwise differences (MPD). MPD of west (1.67–16.00) and central tribes (2.67–18.00) were significantly lower (Z = 2.41, P = 0.016) than the MPD from east (17.17–32.67), whereas MPD from south (11.20–30.00) were not significantly different from east, west and central tribes (Z = 1.39, P = 0.166). Thus observed mtDNA diversity indicate to the fact that haplotype/haplogroup frequency is a poor parameter of deep rooting ancestry rather it is the product of recent population growth. Similarly, the diversity parameters are also influenced by the past demographic events and any phylogenetic inference drawn on such parameter should keep in view the past demographic events, particularly for India where such event has been predicted previously [22].

Table 2 Diversity and demographic parameters deduced from complete mtDNA sequences of M2 lineage in India. Full size table

Past population dynamics

As indicated in our results and previously [22] the demographic history of populations in different geographic regions might have played pivotal role in shaping the complex net of population phylogenetic relationships in Indian subcontinent. The demographic history of M2 lineage as a surrogate of the middle/early upper Palaeolithic component of Indian populations was reconstructed using Bayesian skyline plot (BSP) [23]. Figure 3 (panel 'A') shows the BSP of M2 lineage produced using 76 complete mtDNA sequences along with plot (panel 'B') using only coding region. Although the two analyses are very similar, the second is confined to slow evolving region of mtDNA [24] which is likely to define lineages that have existed in the population prior to a putative bottleneck, thus increasing the sensitivity of BSP to detect more complex demographic trends. As the analysis is based on only single lineage it provides insight into the demographic event limiting to the age of the lineage (i.e. 37–45 kyBP). Most striking is the population decline, observed during Last Glacial Maximum i.e. 23 to 19 kyBP [25] and Late Glacial Aridity i.e. 18 to 14 kyBP [26], followed by many fold population growth in a comparatively short period of time. If such demographic event had affected the earliest settlers of India it would have resulted in several implications of phylogenetic interest. Firstly, reduction of genetic diversity across all the lineages in which, lineages with a smaller population spread would have been affected the most. Second, it might have mechanized some sort of unifying effect where smaller lineages are eliminated or at least reduced to margins of extinction and lineages of larger spread remained among all the post bottleneck populations. The Post Glacial rapid population growth achieved some sort of plateau by 7 to 3 kyBP followed by another decline which was to its maximum around ~1000 to 1500 BP. Now the question is whether the observed demographic trend was uniform throughout India or it was as complex as reported by the earlier studies [22]. A similar analysis for each studied geographical region of India is presented in panel C to F of Figure 3. Due to the small sample size in each geographic region BSP produces low resolution; however rapid post glacial population growth is evident in east, south and central India, followed by a population decline from 3 to 1 kyBP. The rapid regain after this period has been observed in central region; however such regains are marginal in other two regions. The demographic past of ancient lineage among western tribes was quite different- a population growth from ~7–8 kyBP continued to present. The negative values of Fs that differ significantly from zero indicative of population's demographic expansion [27] also support the recent population expansion in western region (Fu's Fs = -7.07; P = 0.004).

Figure 3 Bayesian skyline plots showing demographic histories of earliest settlers' component. The thick solid line is the median estimate, and the grey area overlay show the 95% highest posterior density (HPD) limits. Panel 'A'- The Bayesian skyline plot (m = 10) for India total, derived from complete mtDNA sequences (n = 76). Panel 'B'- The Bayesian skyline plot (m = 10) for India total, derived from coding region (577–16023) mtDNA sequences (n = 76). The time estimates (yBP) were calculated as per [16]. For comparison, the cold and arid period around the Last Glacial Maximum are also indicated on panel A & B. Panel 'C to F' shows Bayesian skyline plots (m = 10) derived from complete mtDNA sequences of eastern (n = 11), central (n = 29), southern (n = 14) and western (n = 22) regions of India respectively. Full size image

Genetic Structure

The above results are indicative of some genetic structure in Indian populations, to investigate that, AMOVA was used (Table 3). In the total samples (model 1) 49.13% of the variance was found within populations and 50.87% among populations. Studied tribes were then grouped according to geographic proximity (model 2), linguistic affinities (model 3) and to the results suggested, namely two groups separating Indo-European speakers of west and central India from all others (model 4). Under the models 2 and 3, 45–48 % of the variance was found within populations, 36–39% among populations within groups and 13–18 % among groups. The model 4 more appropriately reflects the genetic structure with variance among groups 29.93% exceeds the variance among populations within groups 27.79%.