Samples

DNA donors were recruited with informed consent (University of Leicester Research Ethics Committee reference: maj4-cb66). DNA was extracted from various sources including lymphoblastoid cell lines, peripheral blood and saliva. Samples (340) were included in the design from 17 populations (20 males each) across Europe and the Near East. Samples from Greece, Serbia, Hungary, Germany (Bavaria), Spanish Basque country, central Spain, Netherlands (Frisia), Denmark, Norway, Finland (Saami), England38 (Herefordshire and Worcestershire), Orkney38, Ireland and Turkey were collected by the authors. Twenty random Palestinian male samples were purchased from the National Laboratory for the Genetics of Israeli Populations ( www.tau.ac.il/medicine/NLGIP). Finally, samples from two HapMap39 populations were used, both to supplement the population data set and to provide data on externally analysed samples for validation purposes: the Centre d'Etude du Polymorphisme Humain (CEPH) collection in Utah, USA, with ancestry from Northern and Western Europe (CEU) and the Toscani in Italia (TSI). After the initial analyses, one English and one Spanish individual were identified as females and therefore removed from all downstream analyses in this study, reducing the final number of samples to 338. For further details on samples see Supplementary Table 1.

Bait design for target enrichment

For target enrichment Agilent SureSelect (Agilent Technologies Inc., CA, USA) hybridization capture was used. RNA baits were designed using Agilent eArray with default parameters for Illumina Paired-End Long Read sequencing (bait length: 120 bp; design strategy: centred; tiling frequency: 1 × ; avoid overlap: 20 bp; strand: sense; avoid regions: repeat masker) and human reference sequence hg19/GRCh37 (February 2009). Boosting was used for ‘orphan’ (located >20 bp from flanking baits) and GC-rich (⩾63%) baits by direct replication (orphans 2 × , GC-rich 3 × ).

In this study we focus on the eight X-degenerate regions31 of the Y chromosome which are likely to yield interpretable sequence data; other captured regions are discussed elsewhere13. The total length of targeted regions was ∼8.6 Mb, and following capture design and the necessary repeat masking, the designed baits covered 2.3 Mb. Coordinates of the eight targeted regions can be found in Supplementary Data 2.

Sequencing and data processing

Genomic DNA (3–5 μg) was used for library preparation and target enrichment using Agilent SureSelectXT Target Enrichment System for Illumina Paired-End Sequencing Library kit (version 1.3). In order to obtain larger insert sizes, DNA samples were fragmented to ∼250–600 bp without size selection. This resulted in a mean insert size of 330 bp, which increases recovery of sequence data from bait-adjacent regions. Sequencing was done on an Illumina HiSeq 2000 instrument (Illumina, CA, USA) with paired-end 100-bp run to high coverage. Library preparation, target enrichment and sequencing were carried out at the High-Throughput Genomics Centre at the Wellcome Trust Centre for Human Genetics, University of Oxford, UK.

Base calling was done using Illumina Bustard40 and quality control with FastQC41. Sequence data were mapped to the human genome reference (GRCh37) using Stampy v1.0.20 (ref. 42). Local realignment was done using The Genome Analysis Toolkit (GATK) v2.6-5 (ref. 43), followed by duplicate read marking with Picard v1.86 (ref. 44) and base quality score recalibration also with GATK. The individual steps and parameters used are listed in Supplementary Table 9.

Variant calling and filtering

Owing to larger insert sizes (see above) and high efficiency of sequence capture, high sequence coverage was obtained not only at baited regions but also at ∼300 bp flanking the enrichment baits. Therefore, the original bait coordinates were modified by adding 300 bp to either side of each bait followed by merging the overlapping coordinates and increasing the size of the analysed region to 4,433,580 bp.

Data on the 338 male samples described above were co-analysed with simultaneously generated data on an additional 117 samples, described elsewhere13. Variant calling was done using the samtools mpileup v0.1.19 multi-sample option, calling all samples simultaneously (Supplementary Table 9). In total 19,276 raw SNPs were called from 455 male samples.

Raw variants were filtered using vcftools v0.1.11 (ref. 45) and in-house Perl scripts. Filters used for the final data are listed in Supplementary Table 9. As well as the two females, seven samples (four from the European population set) were removed from the final data set due to missing >5% of calls. The filtered data set included 13,474 variant sites from 448 samples, from 0 to 643 missing calls per individual, with an average call rate of 99.8%.

To recover as many missing genotypes as possible for subsequent analyses, they were divided into three groups based on read-depth: DP 0—the genotype call was discarded; DP 2–6—the raw call was accepted; all other cases—the sites were re-called using a single-sample approach to obtain the DP4 field in the vcf, the bam file was checked manually, and the most probable allele was inferred by comparing the bam file with the information contained in the DP4 field. After this procedure, 213/13,474 sites still lacked genotype calls, leading to a final number of 13,261 sites for further analyses.

Having applied the above filters to variant sites, it was necessary to apply the same criteria to non-variant sites. We calculated the depth per sample per site using the GATK DepthOfCoverage tool, filtered for base quality 20 and mapping quality 50, and then applied the criterion of ⩾6 × coverage in ⩾95% of samples. This led to a reduction in the figure of base pairs sequenced from 4,433,580 to 3,724,156 bp. The corresponding coordinates (Supplementary Data 2) were used for all downstream analysis.

Mean raw sequence coverage per sample across the 3,724,156 bp of analysed regions was calculated using Picard v1.93. Sequence depth for the 448 samples varied from 25 × to 85 × per sample, with the average of 51 × . Supplementary Table 1 shows sequence depth information for the 338 male samples described here. The final European data set used here, excluding cases with >5% missing calls, includes 334 individuals and 5,996 SNPs (Supplementary Data 3).

Validation

In silico validation of the 13,261 filtered SNP calls was done using two previously published data sets: genomes sequenced to high-coverage with self-assembling DNA nanoarrays by Complete Genomics46, and Omni2.5 BeadChip genotype data produced at the Broad Institute as part of the 1,000 Genomes Project47. Our samples included 4 and 39 HapMap individuals overlapping with these two data sets, respectively. A Perl script was written to compare the SNP calls in our variant set to overlapping samples and positions in the control sets.

Of the 888 variant sites shared between our data and the Complete Genomics data across four overlapping samples, the false positive and false negative error rates were both 0%. When compared with the Omni data across 241 variant sites and 39 overlapping samples, the error rates were 0.13% for false positives and 1.82% for false negatives. However, all the false calls originated from only 19 variant sites.

To shed light on these comparatively high error rates, we also compared Complete Genomics and Omni data for regions corresponding to our final analysed regions. Across 263 variant sites and 49 overlapping samples, we obtained false positive and false negative rates of 2.85 and 2.36%, respectively. The false calls originated from 30 sites and 15 of those overlap with the sites producing high error rates when comparing our data with Omni. Since the Complete Genomics data set is generally considered to have very high quality then this seems to indicate problems in making correct calls from Omni genotyping data. More detail is provided in Supplementary Table 2.

Phylogenetic inference

PHYLIP v3.69 was used to create a maximum parsimony phylogenetic tree48. Three independent trees were constructed with dnapars using randomization of input order (seeds: 4941985, 62529981 and 38185313), each 10 times. Output trees of these runs were used to build a consensus tree with the consense programme included in PHYLIP package.

The tree was rooted using two Y-chromosomes belonging to haplogroups A and B which were sequenced in the complete data set13. FigTree v1.4.049 was used to visualize the tree ( tree.bio.ed.ac.uk/software/figtree/).

Haplogroup prediction

The presence of known markers was checked using AMY-tree v1.2 (refs 50, 51). This software was developed to predict MSY haplogroups from whole-genome sequence data using a list of known markers. Since our data do not cover the whole MSY but only a proportion of it, the software lacks sufficient information for haplogroup prediction. However, it can be used to deduce the presence and allelic states of known MSY markers present in sequence data. The AMY-tree v1.2 conversion file contains a list of 1,453 known Y-SNPs, of which 490 are present in our data. These 490 sites were used to assign a standard haplogroup to all our samples according to the Y Chromosome Consortium phylogenetic tree20 and its subsequent updates (Supplementary Table 1).

TMRCA and ages of nodes

The TMRCA of the tree and of nodes of interest were estimated via two approaches:

BEAST v1.8 (refs 17, 52): Markov chain Monte Carlo (MCMC) samples were based on 25,000,000 generations, logging every 1,000 steps, with the first 2,500,000 generations discarded as burn-in. Three runs were combined for analysis using LogCombiner. We used an exponential growth coalescent tree prior (growth rate prior: uniform(0–0.002)), HKY substitution model, and a strict clock with a substitution rate of 1.0 (95% CI: 0.92–1.09) × 10 −9 mutations/nucleotide/year 24 . TMRCAs were estimated in a single run including all 17 European populations and assigning samples to specific clades in agreement with the MP tree shown in Fig. 1.

Rho: A Perl script was written to calculate TMRCA and its standard deviation for any given clade within a PHYLIP outfile, using the rho statistic53,54. A scaled mutation rate of 268.5 (246.3–291.9) years per mutation was used, based on a published rate of 1.0 (95% CI: 0.92–1.09) × 10−9 mutations/nucleotide/year24 and the number of nucleotides in our regions of interest (3,724,156).

Bayesian skyline plots

Bayesian skyline plots were generated using BEAST v1.8 (refs 17, 52). MCMC samples were based on 100,000,000 generations, logging every 1,000 steps, with the first 10,000,000 generations discarded as burn-in. We used a piecewise linear skyline model with 10 groups, a HKY substitution model, and a strict clock with a mean substitution rate of 1.0 × 10−9 mutations/nucleotide/year24 and a generation time of 30 years, consistent with 55. For the 13 populations showing a recent expansion in the BSP, the limits of the 95% CI of mutation rate24 (0.92–1.09 × 10−9) were used to define the CI of the time estimate of the minimum effective population size before the expansion (grey shading in Fig. 2).

Intrapopulation diversity and geographical correlation

The number of polymorphic sites per population, Tajima’s D56, and Fu’s FS57 were calculated using Arlequin 3.5 (ref. 58). The number of singletons was calculated using Vcftools v0.1.11. Correlation tests between measures of genetic diversity and latitude and longitude were run in R59 with the function cor.test of the package stats.

Approximate Bayesian computation

We generated one million simulated datasets for each tested model (Supplementary Fig. 2) with the programme FastSimcoal2 (ref. 60), simulating a single haploid locus of 3,724,156 bp. We summarized the data by means of the derived site frequency spectrum (-s -d flags in the command line) considering only categories with at least one observed polymorphic site. Ancestral states in the observed data were defined elsewhere13 using a custom script.

To compare models we applied the Logistic Regression procedure61. Model parameters were estimated by a locally weighted multivariate regression22 after a logtan transformation62 of the 10,000 best-fitting simulations from a specific model. To calculate the posterior probabilities for models and parameters we used R scripts from http://code.google.com/p/popabc/source/browse/#svn%2Ftrunk%2Fscripts, modified by AB and SG.

We also estimated the power of our ABC procedure to correctly recognize the true model calculating for each model the proportion of true positives and false positives. We evaluated 1,000 pseudo-observed data sets generated under each model, counting the number of times a specific model is correctly identified by the ABC procedure (true positives), and the number of times the same model is incorrectly selected as the true model (false positives).

Demographic models and priors

We considered five models depicting different demographic histories, testing each model separately for each population (Supplementary Fig. 2). M1 is the simplest, in which the effective population size remains constant over time (uniform prior: 20–20,000). In M2 an ancient constant-sized population (uniform prior: 1,001–20,000) starts an exponential reduction T1 generations ago. The reduction spans LEX generations (uniform prior: 5–634), then the population returns to constant size (uniform prior: 20–1,000) T2 generations ago (uniform prior: 0–30). In M3 the reduction is followed by an expansion that starts T2 generations ago (with the effective population size, NER, drawn from an uniform prior: 20–1,000) until the present (uniform prior for the current effective population size, NC: 1,001–20,000). M4 is parameterized in the same way as M2, with an expansion instead of a reduction (NA uniform prior: 20–1,000; NC uniform prior: 1,001–20,000). In M5, the expansion ends at T2 followed by a reduction until present time (NEE uniform prior: 1,001–20,000; NC uniform prior: 20–1,000). We considered a generation time of 30 years. In all the models the Last Glacial Maximum (∼20,000 years ago) represents the upper bound of the time for the first demographic change (T1).

In each simulation the per-generation, per-site mutation rate24 is drawn from a normal distribution with mean 3.01 × 10−8 and 95% confidence intervals 2.77–3.26 × 10−8. DNA sequences were generated under a finite sites mutational model with no transition/transversion bias.

Perl scripts used in the analysis are available upon request.