Samples and Genotyping

A total of 8,515 samples were analyzed (Supplementary Table S1) for their Y-chromosome genetics. All participants recruited and genotyped by our team had at least three generations of paternal ancestry in their country of birth and provided details of their geographical origin. A written informed consent was signed and obtained by each participant prior to recruitment for this study. The study protocol and the informed consent form were approved by the IRB of the Lebanese American University. The study methods were carried out in accordance with the principles of the Declaration of Helsinki. DNA was extracted from blood or buccal swabs using a standard phenol–chloroform protocol. Samples were genotyped for binary Y chromosome polymorphisms as reported previously23.

Our previously described samples (n = 2047)23,28,29 were further subtyped to achieve the same Haplogroup (Hg) differentiation as the new samples analyzed here23,28,29, with the additional 6 SNPs (J1e-P58, J2a-M410, J2b-M12, E1b1b1a-M78, E1b1b1b-M81 and E1b1b1c-M123). DNA samples were also typed for 11 microsatellite loci (DYS 388, 389I, 390, 391, 393, 19′, 437, 439, 389II, 392 and 438).

In addition, Y- Hg and haplotype data were also incorporated from prior studies (Supplementary Table S1), including 727 newly genotyped samples from Armenia (ARM - 402 samples), Armenians originating in SE Turkey (TUR - 126), Bahrain (BAH - 40), Iraq (IRQ - 70), the Kingdom of Saudi Arabia (KSA - 7), Cyprus (CYP - 38) and Libya (LIB - 44) (Study references included in the supplementary information.) Haplogroup markers employed by the studies were catalogued (Supplementary Table S2). The Y-STR Haplotype Reference Database (YHRD) tree and International Society of Genetic Genealogy (ISOGG 2011) tree were compared, with YHRD being used as the base nomenclature (Supplementary Table S2) for construction of most derived sets.

Autosomal data were obtained from a prior study30, which had selected 75 Lebanese samples from a pool of 1,341 by stratified random sampling. That dataset had included 994 samples from 48 populations spanning SW Asia, Europe, N. Africa, and into S. Asia. All of these samples were analyzed using Illumina 610 K or 660 K bead arrays. The results were filtered requiring 99% genotype success rate, and removal of sex-linked and mtDNA SNPs, yielding 505,859 SNPs. Further LD pruning (excluding r2 > 0.4) yielded 244,919 SNPs. These QC filtering steps were performed with PLINK31. From these, we retained 174 samples representing SW Asia populations.

Population Pooling

The initial populations were defined primarily based on modern national boundaries or sub-regions. Pooled populations were identified through preliminary BATWING (see Supplementary information) analysis as described below.

We sought to analyze the data without preconceptions of the history of southwest Asian populations, allowing the data to govern their own analysis. Our data preparations sought to ensure the STR loci and haplogroup derivations were uniform across populations and haplogroups, requiring factoring to a lowest common denominator across all haplogroup and population studies.

BATWING analyzes population splits of the form ((A,B),(C,D)) as two topologies instead of one, counting an (A,B) split prior vs. subsequent to the (C,D) split as two distinct topologies, inducing a spurious interaction between (A,B) and (C,D) split times, even though those splitting events are independent. Running BATWING with two conformations (AB, (C,D)), and ((A,B), CD), where AB represents A and B pooled, and CD represents C and D pooled removes the interaction. The two estimates of the first split provide a check for consistency. This process also establishes pooled populations. We adapted the pooled regions for the second phase of analysis when pooling of similar regions following similar expansion routes were shared by multiple haplogroups. The resulting pooled regional designations are Arabia (Saudi Arabia, Emirates, Qatar, Bahrain), Yemen (by itself), Armenia and Turkey (one region), Caucasus (Georgia), Mesopotamia (Iran, Iraq, Kuwait), Cyprus, Southern Levant (Jordan, Palestine), Northern Levant (Lebanon, Syria), Egypt, North Africa (Libya, Morocco, Tunisia, Algeria), and Ethiopia (the data from this preliminary analysis is not shown).

Since BATWING input only involved genetic data and population assignments, results might have been geographically random. Instead, the modal tree population splits easily lie on a map. The combinatorial enumeration of trees vs geography is non-trivial. However, simple limits may be estimated by considering the chances that an east-west split would have accurately classified Northern and Southern Levant (4 populations) on one side and Mesopotamia and Arabia (7 populations) by chance on the other, for example, which has a p-value = 0.003 by a Fisher exact test.

STR-based F ST computations, MDS, and complete linkage agglomerative clustering were not sufficient to resolve regions in a meaningful way. AMOVA (see supplementary information) also did not show significant regional organization. Haplogroup J1′s F CT = 0, with p-value = 0.097 that a random construction would be larger. Haplogroup J2′s F CT = 0.0097, with p-value = 0.486. Haplogroup E1b1b1 shows the greatest divergences, with F CT = 0.233, with p-value = 0.00067.

The parameter set for BATWING is described in the supplementary information. Error bar 95% CIs about the median range from 1/1.75 to ½ of the median to 1.75 to 2 times the median.

Analysis

Frequency and variance maps

Maps showing haplogroup frequency distributions (Supplementary Figure S1) were derived from data from our laboratory listed in Table S1. Frequency (Supplementary Figure S2) and variance (Supplementary Figure S3) contour map construction is described in the Supplementary information.

PCA and MDS

Principal component analysis (PCA) was applied to relative haplogroup frequencies. Multidimensional scaling (MDS), was applied to R ST distances on STR loci across haplogroups from the full derived set, as well as within J2 and J*(xJ2) haplogroups. Details on these methods are in the Supplementary information.

Network

NETWORK was used to compute Reduced Median (RM) networks (Supplementary information) for haplogroups J1e, J*(xJ1e), J2a, J2b, and J2.

Y TMRCA Calculation

BATWING parameters are described in the supplementary information. Times of Most Recent Common Ancestor (TMRCA) estimated using UEP time estimation on J subhaplogroups, J1, J1e, J2, J2a1, J2a1, J2a2, J2a2a, and J2b, for whole-population data drawn from Armenians, Caucasians, Iranians, Jordanians, Lebanese, Palestinians, South East Turks, and Turks, Syrians and Africans are listed in Table 1.

Table 1 Times to most recent common ancestors for each haplogroup, measured in ka, for each region. Full size table

Y Haplogroup expansions

BATWING population splitting was applied to individual haplogroups drawn from both whole-population and haplogroup-specific studies. BATWING samples STR haplotype phylogenies and population split phylogenies, and Bayesian prior distributions to estimate effective population sizes, and mutation rates, using a Metropolis-Hastings Markov-Chain Monte-Carlo simulation. BATWING assumes a single-step mutation model, and coalescence governed by an effective population size, and it models population splitting by dividing the total parent population effective population size among the child populations. The population splitting assumption is the weakest link in application to individual haplogroups. BATWING provides a fixed then expanding effective population size model, allowing for substantial flexibility in population coalescences that the expanding haplogroups will have evolved within as they migrated. Detailed considerations are offered in the supplementary information.

Genome-wide samples and analysis

We analyzed 174 samples from 9 Southwest Asian populations (Georgians, Armenians, Turks, Lebanese, Syrians, Palestinians, Jordanians, Saudis and Yemenis) using published genome-wide marker data29,30,32. A PCA removing outliers left 155 samples (supplementary information). ADMIXTURE (supplementary information) was applied to the reduced set to identify ancestral populations and to compute ancestral F ST s. Divergence dates were estimated from F ST estimates (see the supplementary information for details).