Demographic change of human populations is one of the central questions for delving into the past of human beings. To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region, discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates can be estimated much more precisely using a molecular clock. We found that all the Paleolithic divergences were binary; however, three strong star-like Neolithic expansions at ∼6 kya (thousand years ago) (assuming a constant substitution rate of 1×10 −9 /bp/year) indicates that ∼40% of modern Chinese are patrilineal descendants of only three super-grandfathers at that time. This observation suggests that the main patrilineal expansion in China occurred in the Neolithic Era and might be related to the development of agriculture.

To achieve sufficiently high coverage in the non-recombining regions of Y chromosome (NRY) and an adequate representation of individual samples, we selected 110 males, encompassing the haplogroups O, C, D, N, and Q which are common in East Eurasians, as well as haplogroups J, G, and R which are common in West Eurasians (see Table S1 ), and sequenced their non-repetitive segments of NRY using a pooling-and-capturing strategy.

According to the phylogenetic tree of Y chromosome, all the modern males could be categorized into 20 major monophyletic or paraphyletic groups (referred to as A to T) and their subclades [13] , [14] . Nearly all the Y chromosomes outside Africa are derivative at the SNP M168 and belong to any of its three descendent super-haplogroups – DE, C, and F [9] , [10] , [15] , strongly supporting the out-of-Africa theory. The time of the anatomically modern human's exodus from Africa has yielded inconsistent results ranging from 39 kya [16] , 44 kya [10] , 59 kya [17] , 68.5 kya [18] to 57.0–74.6 kya [19] .

The Y chromosome contains the longest non-recombining region (∼60 Mbp, in which ∼10 Mbp is unique sequence in the genome and easy to analyze) in the human genome [6] , [7] , making it an informative tool for reconstructing genetic relationship of human populations and paternal lineages, and dating important evolutionary and demographic events [8] , [9] , [10] , [11] . However, the sequencing data of Y chromosomes of human populations were insufficient and biased even for those of current 1000-genome project for which coverage on Y chromosome was low (on average <1.4× in East Asian samples) [12] .

Demographic change is one of the central questions in understanding human history, and strong population expansions may be linked to various events as climate changes, alteration of social structure, or technological innovations. The recent advent of next-generation sequencing technology enabled systematic analysis of the population history using the information from the whole genome with less ascertainment bias, so we can re-assess how the various factors have influenced the human population size and structure [1] , [2] . Recent analyses of mitochondrial genomes revealed that the expansions of female lineages of East Asians [3] and those of Europeans [4] started before the Neolithic Era, contradictory to the hypothesis that the agricultural innovation constitutes the primary driving force of population expansions [5] . These observations prompted this study to investigate expansions of male lineages.

This study shows that all the strongly expanding Y chromosomal haplogroups (i.e. O-M175 or C-M130) had already migrated to East Asia more than 20 thousand years before their Neolithic expansion, thus supporting a boom of local farmers in China, which is consistent with the independent origin of agriculture [31] , while differing from the case in Europe, where immigrant farmers from the Middle East contributed to the majority of modern Y chromosomes [32] .

The expansion dates are estimated 5.4 kya for Oα, 6.5 for Oβ, and 6.8 for Oγ ( Fig. 1 ), after the shift to intensive agriculture in North China (since 6.8 kya) [25] , [26] , in particular, during the Yangshao Culture (6.9–4.9 kya) in Central Yellow River Basin, Majiayao Culture (6.0–4.9 kya) in the Upper Yellow River Basin, and the Beixin (7.4–6.2 kya) – Dawenkou Culture (6.2–4.6 kya) in the Lower Yellow River Basin [27] . We therefore propose that in the late Neolithic Age, the three rapidly expanding clans established the founding patrilineal spectrum of the predecessors in East Asia. Since all the sequenced Han Chinese M117+ samples are under the Oα expansion, and M117+ subclade exists in moderate to very high frequency in many Tibeto-Burman ethnic groups [28] , [29] , [30] , it would be of interest to know when the M117+ individuals in other ethnic groups diverged with the ones in Han Chinese, and whether they are also under the Oα expansion, in order to trace the origin and early history of Sino-Tibetan language family.

The most surprising discovery in the tree is the three star-like expansions in Haplogroup O3-M324, i.e. under the M117 clade, the M134xM117 paragroup, and the 002611 clade. Here we denote the three star-like expansions as Oα, Oβ, and Oγ, respectively (see Discussion). Since the sample selection for high-throughput sequencing was intended for representing a wide variety of clades in East Asian populations, a star-like expansion indicates successful expansion of male lineages within a very short period (<500 years). These three clades are present with high frequency across many extant East Asian populations [23] , [24] and encompass more than 40% of the present Han Chinese in total (estimated 16% for Oα, 11% for Oβ, and 14% for Oγ) [20] . It is conspicuous that roughly 300 million extant males are the patrilineal progenies of only three males in the late Neolithic Age.

By using Bayesian method [21] with a constant mutation rate of 1×10 −9 substitution/base/year [7] , [19] (or one substitution per 256 years on 3.9 Mbp range, without considering the uncertainty of mutation rate), we calculated the date of each divergence event throughout the tree. The first divergence event out of Africa, i.e. between Haplogroup DE and the ancestor of C and F, is dated at 54.1 kya (95%CI 50.6–58.2), inside the range of previous estimations. Within the 3.9 Mbp range, only 3 SNPs were observed between the divergence events of DE/CF and C/F, indicating that DE, C, and F likely emerged subsequently in less than a thousand years. After diverged from Haplogroup C, no major split was observed in F for 18 thousand years, suggesting a strong bottleneck of F lineage. It should be noted that all the primary haplogroups (G, J, N, O, Q, and R) emerged before the last glacial maximum (LGM, ∼20 kya), and most of the presently known East Eurasian clades have branched off in the late Upper Paleolithic Age (before 10 kya). All divergences on this tree before 7 kya were binary, suggesting that during the Paleolithic Age, slow population growth and bottlenecks or drift eradicated most of the ever existing clades [22] .

The tree was constructed from 78 samples sequenced in this study, together with three published East-Asian genomes and a chimpanzee genome. The branch lengths (horizontal lines) are proportional to the number of SNPs on the branch. Numbers in red indicate the coalescence time (in years, considering the variation in SNP counting, but ignoring uncertainty in mutation rate) and 95% confidence intervals of the node. For more details, see Fig. S1 .

Overall ∼4,500 base substitutions were identified in all the samples from the whole Y chromosome, in which >4,300 SNPs that has not been publicly named before 2012 (ISOGG etc.). We designated each of these SNP a name beginning with ‘F’ (for Fudan University) (see Table S2 ). We obtained ∼3.90 Mbp of sequences with appropriate quality (at least 1× coverage on >100 out of 110 samples, see Table S3 ), and identified ∼3,600 SNPs in this region. A maximum parsimony phylogenetic tree of the 78 individuals with good coverage was reconstructed ( Fig. 1 and Fig. S1 ), the topology of which is congruent with the existing tree of human Y chromosome [13] , [20] . The tree contained samples from haplogroups C, D, G, J, N, O, Q, and R, and thus represented all the three super-haplogroups out of Africa – C, DE and F. In addition to the known lineages, many new downstream lineages were revealed. All the earlier divergences were found to be bifurcations, except for three star-like structures, i.e. multiple lineages branching off from a single node, were observed under Haplogroup O3a-M324, indicating strong expansion events.

Discussion

Although most of the sequences in this study were obtained from individuals in China, the haplogroup representation (C, D, G, J, N, O, Q, and R) already enabled us to calculate the times of most of the major divergence events outside of Africa, like G/IJK, NO/P etc., since the times were achieved using the hypothesis of molecular clock, and the results of divergence time between haplogroups would not be affected by from whichever continent or country the individuals were sampled. One good sequence from each of two haplogroups is enough for calculating their divergence time, and more sequences could only help to enhance the precision but would not greatly change the result.

The significant improvement of accuracy of dating in this study comparing to former East Asian studies is attributed to the large number of newly discovered SNPs. It is noted that the relative standard deviation of calculated divergence time is in inverse proportion to the square root of observed SNP occurrence (see Discussion S1). Furthermore, the average counts of SNPs from the common ancestor of CF/DE to a modern individual is 210 in this study, limiting the theoretical 95%CI to only ±13.6%, comparing to 9 SNPs on average in the previous study with the 95%CI over ±60% [16]. Considering that 3.9 Mbp range constitutes only less than half of 10 Mbp non-repetitive region in Y chromosome [7], the time resolution of east Asian Y chromosome phylogeny is expected to be doubled in the near future.

The determination of mutation rate is a crucial question in calculation of the absolute divergent times, which caused the most dating differences among the studies[7], [33]. As revealed by previous studies, this inconsistency of mutation rate was resulted from two aspects: among different regions of the chromosome, and between older and younger time scales. The former has been disclosed in a study of autosomes, that the base substitution rate of CpG bases is 9.5-fold that of non-CpG bases [34], as well as for mitochondrial DNA, the substitution rate was not only differentiated between coding and control regions, but also in a base-by-base manner [35]. It is worth to point out that recently, Wei et al. published a similar study about Y chromosome sequencing of 36 individuals (mainly Haplogroup R1b and E1b), in which 3.15 or 8.83 Mbp range was sequenced [19], and they achieved a time of out-of-Africa at 57–74 kya using various methods, which is slightly older than our result (54 kya), although the same mutation rate of 1×10−9 substitution/base/year were employed. The difference could be ascribed to the regions chosen for date estimation; we compared the regions that Wei et al. and we studied, and found that in their study, the SNP density in the region that was sequenced only in their study is significantly higher than that in the region that both studies have sequenced (P<0.005) (Table S4).

The difference between long-term (evolutionary) and short-term (genealogical) mutation rates has also been observed before. For calculating the divergence time using Y-chromosomal STR (short tandem repeat), the father-son mutation rate is about three times the “evolutionary” [36]; similar rate difference was also observed for mitochondrial nucleotide substitution rate [35], [37]. This controversy is usually explained by selection on deleterious mutations [38]. The autosomal genealogical substitution rate was estimated at 1.2×10−8 substitution/base/generation [34], [39], which is less than half of the rate we used in this study. However, due to that 80–85% of de novo mutations are attributed to the father's side [34], and that the Y chromosome contains the least genes among the chromosomes and thus underwent lessened purifying selection [40], the mutation rate used in this study is still compatible with the previous studies.

We also compared human and chimpanzee Y chromosomes (see SI Methods), and found ∼45,800 substitutions between the two species which fall into the range that we compared for human samples; roughly 1/4 intra-human SNPs have no homologous loci on chimpanzee Y chromosome. Assuming the divergence between human and chimpanzee was at ∼6,000 kya [41] and a constant substitution rate, the divergence time for DE and CF would be only 40 kya, which is younger than our result. This suggests that base-substitution rate between human and chimpanzee is higher than the rate inside human species, which can be explained with the huge interspecies difference of the Y-chromosome structures, and the observation that the chimpanzee Y chromosomal genes decayed faster than human [40].

To overcome the factors for uncertainty of mutation rate, a calibration with series of samples of comparable time scales might be used. For the case of mitochondrial DNA, a recent study, in which several C-14 calibrated ancient complete sequences (4–40 kya) were incorporated into the tree, made the absolute dates much more convincing [42], and we expect a parallel calibration for the Y chromosome in the near future.

Despite of the mutation rate uncertainty, we evaluate our calculation of absolute divergence time as acceptable. Firstly, our out-of-Africa date (54.1 kya) is still within the range of previous estimations (39–74.6 kya). Secondly, the out-of-Africa date is similar to the recent estimation of two great mitochondrial expansions outside Africa – M (49.6 kya) and N (58.9 kya) [43]. Thirdly, it is not contradictory to the emergence of earliest modern human fossil out of Africa (e.g. ∼ 50 kya in Australia) [44].

The accumulative substitution count from the DE/CF divergence to a modern individual varies from to 168 (YCH113) to 241 (YCH198). Despite of this variation, by testing the assumption of molecular clock for the tree, the null hypothesis of a molecular clock could not be rejected (P>0.05) using PAML package v4.4 [45] with the GTR model, unlike the mitochondrial tree from complete sequences, which showed violation to the clock assumption [43]. Part of the branch length variation may come from the false negative detection of SNPs, especially on a long terminal branch; however, this effect was mostly eliminated that we chose only the sequences with good quality for time estimation, so the branch length difference for these sequences should mainly reflects the real variation, and should have little effect in time estimation.

The current Y haplogroups were named according to the rule of Y Chromosome Consortium (YCC) [14]. Along with increasing clades being discovered, the present nomenclature became cumbersome in some cases, e.g. ‘R1b1b2a1a2d3a’ in the ISOGG tree 2010 (http://www.isogg.org), which is prone to frequent name changes and hard to remember. Another commonly used nomenclature such as ‘O-M117’ or ‘O-F46’ is also not suitable for a determined star-like expansion, since there are many SNPs found ancestral to the star point, while these SNPs may be found not all equivalent in the future, e.g. some individuals might be found M117+ but not belonging to the star expansion, then the name of the star point must be renewed. Therefore, here we propose a modification to the current nomenclature system: for any important star-like expansion that leads to large population (e.g. several millions) and multiple lineages (≥5) in short time as revealed by long-range sequencing (>1 Mbp was needed in order to limit the expansion within 1,000 years), a lineage name with lowercase Greek letter is applied directly after the Latin capital letter of the first-class haplogroup name. For example, the star-like expansions under M117, F46, and F11 are now named as Oα, Oβ, and Oγ, respectively, and their downstream lineages should still be named following the rule of YCC 2002, with Arabic number succeeding the Greek letter, e.g. Oα1a1. These names of the star-like expansions are not bound to any single defining SNP (e.g. M117), but to the expansion itself, i.e. the expansion names should always keep unchanged despite new side clades would be found to its upstream, in order to keep the nomenclature stable. For the currently equivalent SNPs on the branch leading to the expansion, we will know the occurring order only after vast amount of samples being genotyped for those SNPs.

Since all the Paleolithic divergences of Y chromosome lineages are binary, the three roughly contemporaneous star-like expansions revealed in this study indicate a remarkable demographic change in the late Neolithic Age. The earliest agriculture in North China emerged before 10 kya [46], however, no distinct Y chromosomal expansion could be related to this event. The three star-like expansions happened several thousand years later, thus are likely linked to middle Neolithic cultures such as Yangshao (6.9–4.9 kya) and Dawenkou Culture (6.2–4.6 kya) in the Yellow River Basin [27]. During this period, agriculture became mature and intensive, and the majority of human diet shifted from food collection into production [47], [48]. Crop harvest constituted a more stable food source than hunting and gathering, and enabled nourishing population at higher density. In addition, liberation of males from hazardous hunting might have enhanced male viability into adulthood, thus the effective population size of Y chromosome increased. Besides the progress in agriculture, changes in social structure might also contribute to the patrilineal expansion. In the middle and late phases of Yangshao and Dawenkou culture, the burial customs showed a gradual transition from an egalitarian matrilineal society into a hierarchical patrilineal one [49], [50]. Interestingly, the major maternal expansions in China shown by mitochondrial tree (among which are also several star-shaped expansions) occurred much earlier, at the late Paleolithic Age [3]. This immense non-synchrony between maternal and paternal expansion suggests a possible transition of social structure, that in the late Neolithic Age, a few paternal lineages achieved greater advantage on the existing basis of the population that started expansion since the Paleolithic Age. After the strong Neolithic expansions, the reproductive advantage of the farmers lasted for 4,000 years, until most of the gatherer-and-hunter tribes in the Yellow River Basin were absorbed by the farming societies of Huaxia, from which the Han ethnicity was formed.

Although without ancient DNA proofs, we cannot yet confirm the initial expanding regions of these three clans, whether they were original in the middle or lower reach of Yellow River Valley or migrated from the vicinity, we are now at least certain that a majority of Han Chinese did derive from just a few patrilineal ancestors in the Neolithic Age. Whether each of them could be related to the legendary Emperors Yan and Huang or their tribes, is to be solved with more prudence and with the help of interdisciplinary genetic, archeological, ethnical, and documentary studies.