NRY and mtDNA diversity

We obtained approximately 500 kb of NRY sequence from the 623 males in the HGDP, and complete mtDNA genome sequences from these 623 males plus an additional 329 females from the HGDP. The average coverage of the NRY sequences was 14.5X (range, 5X-37.5X, Additional file 3: Figure S1), while for the mtDNA genome sequences the average coverage was 640X (range, 46X-4123X, Additional file 3: Figure S1). After quality-filtering, imputation, and removal of sites with a high number of recurrent mutations, there remained 2,228 SNPs in the NRY sequences. The mtDNA analyses here are restricted to the 623 males for which NRY sequences were obtained, for which there were 2,163 SNPs; results based on the mtDNA genome sequences from the entire set of HGDP samples (952 individuals) did not differ from those based on the subset of 623 males (for example, Additional file 3: Figure S2). More details about the results from each individual, including mtDNA and NRY haplogroups, are provided in Additional file 1: Table S1. The mtDNA sequences have been deposited in Genbank with accession numbers KF450814-KF451871. A datafile with the alleles at each of the NRY SNPs in each sample has been provided to the CEPH-HGDP and additionally is available from the authors. The NRY raw sequencing data are in the European Nucleotide Archive with the study accession number PRJEB4417 (sample accession numbers ERS333252-ERS333873).

Basic summary statistics for the mtDNA and NRY diversity in each population are provided in Additional file 3: Table S3. As the sample sizes for many of the individual populations are quite small, for most subsequent analyses we grouped the populations into the following regions (based on analyses of genome-wide SNP data [43, 47]): Africa, America, Central Asia, East Asia, Europe, Middle East/North Africa (ME/NA), and Oceania (the regional affiliation for each population is in Additional file 1: Table S1). The Adygei, Hazara, and Uygur were excluded from these groupings as they show evidence of substantial admixture between these regional groups [43, 47]. We stress that the use of regional names is a convenience to refer to these groupings of these specific populations, and should not be taken to represent the entirety of the regions (for example, ‘Africa’ refers to the results based on the analysis of the combined African HGDP samples, not to Africa in general).

Some basic summary statistics concerning mtDNA and NRY diversity for the regions are provided in Table 1. The π values we report are for the most part somewhat larger than reported in a previous study of eight Africans and eight Europeans [50], which is not unexpected given the much larger sampling in our study. Notably, we find substantial variation among geographic regions in amounts of mtDNA versus NRY diversity; this is shown further in the comparison of the mean number of pairwise differences (mpd) for mtDNA and the NRY (Figure 2A). The mtDNA mpd for Africa is about twice that for other regions, while the NRY mpd is greatest in the Middle East/North Africa region, and only slightly greater in Africa than in the other regions (with the exception of the Americas, which show substantially lower NRY diversity). Overall, there are striking differences in the ratio of NRY:mtDNA mpd (Table 1), with Africa, Central Asia, and the Americas having significantly less NRY diversity relative to mtDNA diversity, compared to the other regional groups. Moreover, differences in relative levels of NRY:mtDNA diversity are also evident in the individual populations (Additional file 3: Table S3), although the small sample sizes indicate that the individual population results must be viewed cautiously.

Table 1 Summary statistics for regional groups Full size table

Figure 2 Diversity and AMOVA results. (A) Mean number of pairwise differences (and SE bars) for the NRY and mtDNA sequences from each regional group. (B) AMOVA results for the entire worldwide dataset, and for each regional group of populations. Two comparisons are shown for the entire dataset; the left comparison includes regional groups as an additional hierarchical level, while the right one does not. * indicates that the among-population component of diversity does not differ significantly from zero (after Bonferroni adjustment of the P value for multiple comparisons). Full size image

NRY and mtDNA population differentiation

An outstanding question is whether or not there are differences in the relative amounts of between-population versus within-population diversity for mtDNA versus the NRY, as some studies have found much larger between-population differences for the NRY than for mtDNA [6] while others have not [7]. To address this question, we carried out an AMOVA; the results (Figure 2B) show that in the entire worldwide dataset, the between-population differences are indeed bigger for the NRY (approximately 36% of the variance) than for mtDNA (approximately 25% of the variance). However, there are substantial differences among the regional groups. The ME/NA, East Asia, and Europe regional groups follow the worldwide pattern in having bigger between-population differences for the NRY than for mtDNA. In contrast, Africa, Oceania, and the Americas have substantially bigger between-population differences for mtDNA than for the NRY, while for central Asia the between-population variation is virtually identical for the NRY and mtDNA. These regional differences likely reflect the influence of sex-biased migrations and admixture, as discussed in more detail below, and moreover indicate that focusing exclusively on the worldwide pattern of mtDNA versus NRY variation misses these important regional differences.

We also investigated the relationship between geography and genetic distance. Despite the small sample sizes at the population level, both mtDNA and NRY Φ ST distances are significantly correlated with geographic distances between populations (Mantel tests with 1,000 replications: mtDNA, r = 0.41, P <0.001; NRY, r = 0.36, P = 0.002) as well as with each other (r = 0.23, P = 0.025). Thus, NRY and mtDNA divergence are both highly associated with geographic distances among populations.

MtDNA and NRY phylogenies

Although the primary purpose of this study is to compare demographic insights from mtDNA and NRY sequences that were obtained free of the ascertainment bias inherent in haplogroup-based approaches, we recognize that there is also useful information in the haplogroups. In this section we therefore present some haplogroup-based results. We first used a Bayesian method to estimate the phylogeny and divergence times for both mtDNA and the NRY (Figure 3); for the latter, we used both a ‘fast’ mutation rate of 1 × 10−9/bp/year and a ‘slow’ mutation rate of 0.62 × 10−9/bp/year as there is currently much uncertainty regarding mutation rates [5, 40, 41, 51, 52]. The resulting phylogenies are in general consistent with the existing mtDNA and NRY phylogenies [31, 53], although there are some discrepancies, for example, in the mtDNA tree (Figure 3A) L1 sequences group with L0 sequences rather than on the other side of the root, while additional discrepancies can be found in the NRY trees. However, all of these discrepancies involve nodes that have low support values (red asterisks in Figure 3) and hence low confidence; the nodes that have strong support values are all in agreement with the existing mtDNA and NRY phylogenies. The inability of the Bayesian analysis to completely resolve the phylogenies has two causes: for the mtDNA phylogeny, frequent back mutations and parallel mutations at some sites confounds the analysis; for the NRY phylogenies, some branches in the accepted phylogeny are supported by only a few SNP positions that are not included in our sequence data.

Figure 3 Bayesian trees and divergence time estimates for mtDNA and NRY haplogroups. (A) mtDNA haplogroups; (B) NRY haplogroups with the fast mutation rate; (C) NRY haplogroups with the slow mutation rate. Red asterisks denote nodes with low support values (<0.95). F* in the NRY trees indicates a sample that was assigned to haplogroup F by SNP genotyping, but does not fall with other haplogroup F samples. Some NRY haplogroup K samples formed a monophyletic clade (labelled K in the trees) while others fell with haplogroup M samples (labelled KM in the trees); see also Additional file 3: Figure S8. Full size image

The age of the mtDNA ancestor is estimated to be about 160 thousand years ago (kya), and the ages of the non-African mtDNA lineages M and N are about 65 to 70 kya, in good agreement with previous estimates [54]. Our estimate for the age of the NRY ancestor is 103 kya based on the fast rate, and 165 kya based on the slow rate; however these estimates do not include the recently-discovered ‘A00’ lineage [41], which would result in much older ages for the NRY ancestor. The close agreement between the slow NRY ancestor age (165 kya) and the mtDNA ancestor age (160 kya) might be taken as evidence in favor of the slow NRY mutation rate. However, the slow NRY mutation rate gives an estimated age for the initial out-of-Africa divergence of about 100 kya, and an age for the divergence of Amerindian-specific haplogroup Q lineages of about 20 kya, while the fast rate gives corresponding estimates of about 60 kya for out-of-Africa and about 12.5 kya for Amerindian haplogroup Q lineages, in better agreement with the mtDNA and other evidence for these events [54–57]. Given the current uncertainty over mutation rate estimates, we have chosen to use either both estimates in further analyses (for example, Bayesian skyline plots) or an average of the fast and slow rates (for example, in simulation-based analyses); in Additional file 3: Table S4 we provide divergence time estimates and associated 95% credible intervals for the branching events shown in the phylogenies in Figure 3.

NRY and mtDNA haplogroup frequencies per population are shown in Additional file 3: Table S5 and Additional file 3: Table S6, respectively. The mtDNA haplogroups were called from the sequences determined here, while the NRY haplogroups were previously determined by SNP genotyping [58, 59]. The NRY haplogroup information we provide is taken only from these published data; we did not infer haplogroups from the sequences, in order to have an independent comparison of the NRY tree with the haplogroups. The phylogenetic relationships for the NRY sequences are generally concordant with the SNP-genotyping results (with some exceptions, discussed in the legends to Figures S3 to S12 in Additional file 3). The haplogroup frequencies provide further insights into some of the different regional patterns of mtDNA versus NRY diversity noted previously. For example, the comparatively low diversity and smaller differences among populations for the NRY in Africa is due to the high frequency of NRY haplogroup E (55% to 100% in the non-Khoisan groups; Additional file 3: Table S5). This haplogroup is widespread in western Africa, and specific subhaplogroups of haplogroup E are associated with the Bantu expansion [59–61]. The comparatively low NRY diversity in the HGDP Africa regional group thus likely reflects a ‘homogenizing’ effect of the Bantu expansion. NRY haplogroup E is also of interest because it occurs in some European and ME/NA groups, at frequencies of up to 17%, as well as in a few individuals from Central Asia (Additional file 3: Table S5). Inspection of the phylogeny of haplogroup E sequences (Additional file 3: Figure S7) reveals that all of the European and most of the ME/NA haplogroup E sequences form a clade distinct from the African haplogroup E sequences, and the age of this clade is about 18 kya. Moreover, all of the European haplogroup E sequences fall into a subclade that is about 14 kya. These results may reflect a migration from North Africa to Europe suggested from analyses of genome-wide SNP data [62], and would thus provide a timeframe for this migration.

In Oceania, the bigger differences between populations for mtDNA than for the NRY (Figure 2B, Table 1) probably reflect the high frequency of mtDNA haplogroup B in just one of the two Oceania populations (75% in the Melanesian population vs. 0% in the Papuan population; Additional file 3: Table S6). MtDNA haplogroup B is associated with the Austronesian expansion [63–65]. By contrast, NRY haplogroups associated with the Austronesian expansion, such as haplogroup O [63, 66, 67] are absent in the HGDP Oceania populations (Additional file 3: Table S5). This contrast further testifies to the larger maternal than paternal impact of the Austronesian expansion on Oceanian populations [63, 66–69].

In the Americas, there are dramatic differences in mtDNA haplogroup frequencies among populations (the Karitiana and Surui are 100% haplogroup D, the Pima are 100% haplogroup C, the Maya are 100% haplogroup A, and the Colombians are 50% haplogroup B and 50% haplogroup C; Additional file 3: Table S6), which are at least partly due to the small sample sizes but also in keeping with previous studies [70]. However, all NRY sequences from the Americas fall into haplogroup Q (with the exception of one Pima with a haplogroup G sequence that likely reflects recent European admixture), and overall NRY diversity is substantially reduced in the Americas, compared to mtDNA diversity (Table 1, Figure 2). While the small number of HGDP males from the Americas precludes any definitive statements, the apparently much greater mtDNA than NRY diversity in the Americas might indicate that fewer males than females were involved in the colonization of the Americas, and deserves further investigation.

We note some additional features pertaining to specific populations in the individual NRY haplogroup phylogenies provided in Figures S3 to S12 in Additional file 3, while the full mtDNA phylogeny for the HGDP samples is provided in Figure S13 in Additional file 3.

Demographic history

Sequence-based analysis of NRY variation permits demographic analyses that cannot be carried out with ascertained SNP genotype data, and which can then be compared directly to similar analyses of the mtDNA sequences. In the following demographic analyses, only the sequence data were used, and not any of the haplogroup information. We first estimated the history of population size changes via Bayesian skyline plots (BSPs) for the NRY and mtDNA sequences for each region (Figure 4). These results should be interpreted cautiously, both because of the small sample sizes for some of the regions (in particular, America and Oceania), and because grouping populations with different histories can produce spurious signals of population growth [71]. Moreover, the uncertainty concerning the NRY mutation rate makes it more difficult to compare the timing of population size changes for the NRY versus mtDNA. Nevertheless, both the mtDNA and NRY BSPs indicate overall population growth in almost all groups, but for mtDNA there is a more pronounced signal of growth at around 15,000 to 20,000 years ago than there is for the NRY, and during much of the past it appears as if the effective size for females was larger than that for males (Figure 4).

Figure 4 Bayesian skyline plots of population size change through time for regional groups. Two curves are shown for the NRY data, based on ‘fast’ and ‘slow’ mutation rate estimates. Full size image

To further investigate female and male demographic history, we used simulations and ABC to estimate the current and ancestral effective population size for females (N f ) and males (N m ) for Africa, Europe, East Asia, Central Asia, Oceania, and the Americas. We also estimated the ancestral N f and N m for the out-of-Africa migration. We first used the model in Figure 1 and the combined mtDNA and NRY sequences (using an average of the fast and slow mutation rates for the latter) to estimate the divergence times associated with this model (with the prior distributions for the divergence times given in Table 2). Table 2 also provides measures of the reliability of the resulting parameter estimation based on the pseudo-observed values: average R2 = 0.9, which exceeds the suggested threshold [72] of 10%; average coverage is 89% and factor 2 (proportion of estimated values for the statistics that are within 50% to 200% of the true value) is 90%; the average bias is 2% and relative mean square error (RMSE) is 9%. As these measures indicate satisfactory performance of the simulation [72], we retained the top 1,000 simulations (tolerance of 0.02%) for estimating the divergence times. In addition, the posterior distributions show a markedly improved fit to the summary statistics, compared to the prior distributions (Additional file 3: Table S7, Figure S14). The resulting estimates of divergence times for the model in Figure 1 are provided in Table 2, and are generally in good agreement with previous estimates for the divergence time among continental groups [45, 73, 74].

Table 2 Prior estimates of divergence time (all priors uniformly distributed) and the mean, mode, and 95% HPD (highest posterior density) intervals Full size table

Coverage is the proportion of times the true value for the parameter lies within the 90% credible interval around the parameter estimate; and Factor 2 is the proportion of estimated values that are within 50% and 200% of the true value.

We next carried out separate simulations based on NRY and mtDNA sequences, respectively, and obtained ABC estimates of current and ancestral N m and N f for each regional group and for the out-of-Africa migration. Although the reliability measures indicate greater variance in the simulation results (Tables 3 and 4), the posterior distributions still show a markedly improved fit to the summary statistics (Additional file 3: Tables S8 and S9; Figures S15 and S16). The distribution of the estimated current and ancestral N f and N m are shown for each regional group in Figure 5, and a pictorial summary is provided in Figure 6. The simulation results suggest a small founding size in Africa of about 60 females and 30 males (all population sizes are effective population sizes); migration out of Africa about 75 kya associated with a bottleneck of around 25 females and 15 males; migrations from this non-African founding population to Oceania 61 kya, to Europe 49 kya, to Central and East Asia 37 kya, and from East Asia to the Americas about 15 kya. These divergence times are in reasonable agreement with those in the mtDNA and NRY phylogenies, given the wide confidence intervals on both (Table 2, Additional file 3: Table S4). There was concomitant population growth in all regions (with the most growth in East Asia); however, throughout history the mtDNA and NRY results indicate consistently larger effective population sizes for females than for males (except, possibly, in the ancestors of East Asians).

Table 3 Current and ancestral estimates of male effective population size (N m ) based on simulations of the HGDP NRY sequences Full size table

Table 4 Current and ancestral estimates of female effective population size (N f ) based on simulations of the HGDP mtDNA sequences Full size table

Figure 5 Distribution of N f and N m values, based on simulations. The density of the top 1% of the posterior values obtained from simulations of the mtDNA and NRY sequences are shown. (A) ancestral effective population sizes; (B) current effective population sizes. The dashed line in each plot follows a 1:1 ratio. Full size image