Abstract Iran, despite its size, geographic location and past cultural influence, has largely been a blind spot for human population genetic studies. With only sparse genetic information on the Iranian population available, we pursued its genome-wide and geographic characterization based on 1021 samples from eleven ethnic groups. We show that Iranians, while close to neighboring populations, present distinct genetic variation consistent with long-standing genetic continuity, harbor high heterogeneity and different levels of consanguinity, fall apart into a cluster of similar groups and several admixed ones and have experienced numerous language adoption events in the past. Our findings render Iran an important source for human genetic variation in Western and Central Asia, will guide adequate study sampling and assist the interpretation of putative disease-implicated genetic variation. Given Iran’s internal genetic heterogeneity, future studies will have to consider ethnic affiliations and possible admixture.

Author summary Based on genome-wide genotype data on over 1000 samples from eleven ethnic groups present in Iran and by comparison to reference data sets of both extant populations and ancient DNA samples, we show that the Iranian population comprises distinct genetic variation with respect to populations in close geographic proximity, a cluster of genetically largely overlapping ethnic groups as well as a number of strongly admixed groups. These observations, also corroborated by f3 migration statistics and other approaches, indicate genetic continuity of and limited influx into the cluster groups over several millennia, despite Iran’s geographic position at a crossroads in West Asia. They also suggest, correspondingly, several instances of language adoption instead of demic replacement in the past. Future human genetic studies, both with a focus on population and medical genetics, will have to consider differences in heterogeneity, consanguinity and degree of admixture between the ethnic groups for an adequate design and interpretation.

Citation: Mehrjoo Z, Fattahi Z, Beheshtian M, Mohseni M, Poustchi H, Ardalani F, et al. (2019) Distinct genetic variation and heterogeneity of the Iranian population. PLoS Genet 15(9): e1008385. https://doi.org/10.1371/journal.pgen.1008385 Editor: Giorgio Sirugo, University of Pennsylvania, UNITED STATES Received: February 6, 2019; Accepted: August 27, 2019; Published: September 24, 2019 Copyright: © 2019 Mehrjoo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The quality-controlled data set comprising 1021 individuals and 829,779 autosomal markers is available from http://genotype.iranome.ir/. Data access is restricted exclusively to population genetic analyses. Data access requests to the dedicated Data Access Committee (DAC) should be send to: Shahin Akhondzadeh, Professor of Neuroscience, Department of Psychiatry, School of Medicine, Psychosomatic Medicine Research Center, Roozbeh Hospital, Tehran University of Medical Sciences, Tehran, Iran (s.akhond@neda.net). Requests have to include a description of the planned project and the planned use of the provided data set. The Data Access Committee comprises four members: Prof. Hossein Najmabadi, PhD (University of Social Welfare and Rehabilitation Sciences, Tehran, Iran) Prof. Shahin Akhondzadeh, PhD (Tehran University of Medical Sciences, Tehran, Iran) Prof. Dr. Michael Nothnagel (University of Cologne, Cologne, Germany) Dr. Holger Thiele (University of Cologne, Cologne, Germany). Prof. Akhondzadeh and Dr. Thiele did not co-author this manuscript. Funding: This work was supported by Deputy of Research, University of Social Welfare and Rehabilitation Sciences (grant number: 95/801/T/32058) to H.N., Iran National Science Foundation (grant number: 950022) to H.N., (grant number: 92035782) to K.K. We also thank Iran’s National Elites Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction The highlands of Iran have been at the crossroads of human migrations [1–6] since the dispersal of modern humans out of Africa due to their geostrategic position. While exercising a strong cultural influence on neighboring regions, Iran has also repeatedly received migratory influx in the past millennia. Among others, this includes the successive southward migration of groups of Indo-European (IE) language speakers (e.g. Scythians, Medes and Persians) [7], the Arab arrival in the 7th century CE and the later influx of Turkic-speaking people from Central Asia. As a result of migrations, internal splits, admixture and other movements, today’s Iranian population comprises numerous ethnic, religious and linguistic groups (S1 Appendix, S1 Fig), prominently including Persians (65% in 2008 [8]), Iranian Azeris (16%), Iranian Kurds (7%), Iranian Lurs (6%), Iranian Arabs (2%), Iranian Baluchis (2%), Iranian Turkmen (1%), Qashqai and other Turkish-language tribal groups (1%) as well as Armenians, Assyrians, Georgians, Jews, Zoroastrians (all <1%) and others, although definitions [9] and reported proportions vary between sources (e.g. [10–12]). Speakers of an Iranian, i.e. Indo-European, language or language dialect (e.g. Persian, Kurdish, Luri, Baluchi) by far outnumber speakers of either a Turkic or Semitic language. With Iran being located within a belt of countries where consanguineous marriages are widely practiced, Iranian samples have featured prominently in disease-related studies, facilitating the identification of genes involved in rare autosomal recessive diseases by linkage analysis and autozygosity mapping and contributing to a deeper etiological understanding also of complex disorders [13, 14]. Examples demonstrating the value of these populations for human genetic research are ample (e.g. [15–21] for Iran alone), likely moving from the study of few families to population-based studies in the future [22–25]. Still, consanguinity levels are not homogenous across the Iranian population. Early studies, based not on actual genetic data but on familial relation assessment, found these levels to vary between geographic regions and between ethnic groups [26, 27]. A recent study, also based on familial relation assessment, refined these results and reported differences in consanguinity by province, area of residence, birth and marriage cohort as well as with educational level [28]. Patterns of runs of homozygosity (ROHs) or haplotype sharing by descent (HBD) can be indicative of autozygosity, but vary between populations and across genomic locations [29–34], as do the frequency of consanguinity and the moderately correlated degree of genomic inbreeding [35]. Furthermore, autozygosity mapping is predominantly able to detect comparatively recent, local founder mutations [13]. Moreover, carrier frequencies of disease-predisposing variants have been reported to strongly differ between geographic regions in Iran, e.g. for mutations in the GJB2 gene [36] and for β-thalassemia [37], with different ethnic affiliations being the likely cause and possibly helping to determine the pathogenicity of those variants [38]. Finally, studies on copy-number variation (CNV) in the Iranian population (e.g. [39]) were so far disease-specific but not with respect to the general, healthy population. Perhaps somewhat surprisingly, Central Asia and parts of Western Asia have largely been a blind spot for non-medical genetic studies in the past decades. Until recently, dedicated genetic projects of extant human populations with a global or continental focus (e.g. [33, 40–51] only sporadically included samples, if any, from Iran and did not comprehensively cover the area. Of note, studies that did include Iranian samples frequently treated them as coming from or being representative of a single homogeneous population. Studies on sporadic ancient DNA (aDNA) samples from the Early Neolithic up to the Chalcolithic in Iran showed the existence of highly genetically differentiated populations that were not ancestral to Europeans but, in the case of specimen from the Zagros Mountains, exhibited some affinity to Zoroastrians [1, 2, 6, 52]. An early study on ABO blood groups found extreme differences between some of 21 considered ethnic groups in Iran [53], whereas another study, published a year later and additionally based on serum proteins and cell enzymes, presented evidence for population substructure between the six included groups (Iranian Turks, Kurds, Lurs, Zabolis, Baluchis and Zoroastrians) with an average F ST value of 0.02, based on blood groups, serum proteins and cell enzymes, and some degree of inbreeding [54]. More regionally focused studies on Iran, based on uniparental markers such as Y-chromosomal haplogroups and short tandem repeat (STR) marker haplotypes as well as mitochondrial (mtDNA) haplogroups, confirmed high degrees of genetic diversity in the Iranian population [3–5, 55–61]. These studies reported the respective variation to be predominantly of Western Eurasian origin, with only limited contributions from eastern Eurasia, South Asia and Africa most pronounced in the southern Iranian provinces. These studies also reported ancient and recent gene flow between Iran and the Arabian Peninsula, a surprisingly close relationship between Persians and Iranian Turkic-speaking Qashqai and generally high levels of variation comparable to those in the South Caucasus, Anatolia and Europe. These observations all support the notion of Iran forming a crossroads of human migrations. Notably, a study on Armenians, located to the North of Iran, also suggested multiple admixture events and a general role as bridge between different geographic regions [49]. Using genome- or exome-wide genotype data, a number of studies have analyzed samples of populations that can be considered proxies for ethnic groups in Iran from surrounding countries. In a study of 156 individuals, the population of Qatar was reported to comprise three distinct groups, with one (“Q2”) showing strong affinity to Persians and patterns of admixture [3, 62, 63]. A study on 22 Kuwaitis with Persian ancestry found comparatively high levels of genetic diversity for a non-African population, explicable by past admixture events [34]. A study of 43 individuals belonging to the Parsis, a Zoroastrian religious community in India and Pakistan, demonstrated a closer genetic affinity to today’s Iranian and Caucasus populations than to South Asian populations, but, quite remarkably, an even stronger similarity to Neolithic aDNA samples from Iran compared to modern Iranians, consistent both with the historic record of a southward migration induced by the 7th century’s Arab entry to Iran and more recent admixture events with the modern Iranian population [64]. Findings of increased homogeneity and the dating of past admixture events in further samples of Iranian and Indian Zoroastrians [65] complemented these results. Analysis of 24 individuals from the Indo-European speaking Kalash, a population isolate at the Hindu Kush, Afghanistan, indicated a genetically drifted ancient northern Eurasian population that split during the very early Neolithic and subsequently migrated southwards [66]. Finally, a recent study restricted to exome data merged 87 Iranian with 136 Pakistani samples and demonstrated a somewhat extreme or isolated position when compared to other populations from the Maghreb and from the Arabian Peninsula through Turkey [33]. Still, none of these studies has directly and comparatively studied ethnic groups in Iran. Correlation between genetic and linguistic proximity of populations has frequently been assumed to be the rule, while language adoption is usually considered as an exception to the rule of co-evolution (e.g. [67–69]), although such claims have repeatedly been disputed (e.g. [70]). Evidence for such correlation is ample in Europe, including autosomal and mitochondrial data [71–77], Y-chromosomal data [77–80] and even, with respect to the spread of Indo-European languages into Europe, ancient DNA data [81]. In-depth studies on other parts of the world found some correlation of language dispersal with Y-chromosomal lineages [82–87], although not in all parts [88]. Furthermore, some instances of male-mediated gene flow over major linguistic barriers have been inferred as well [89, 90]. An early study already observed close genetic relationship between Semitic-speaking and Indo-European-speaking groups in Iran [58]. Studies on neighboring Armenia found evidence for a language replacement [91] event, possibly facilitated by the mixing of multiple source populations during the Bronze Age [49]. However, the relationship between genetic and linguistic proximity has been rarely investigated for Iran and neighboring countries. While Iran appears to be destined to make further important contributions to human genetic research, an adequate design and interpretation of future medical and population genetic studies is mandatory to arrive at interpretable findings. Here, we comprehensively analyzed the genome-wide diversity of eleven ethnic groups in Iran, their relation to each other as well as with global and local reference populations. Furthermore, we investigated, stratified by ethnicity, levels of consanguinity, the distribution of homozygous and copy-number regions and CNVs as well as the extent of population stratification within Iran and the possible effects in association studies if not accounted for properly and the relationship between spoken language family and genetic proximity.

Discussion Our study, based on genome-wide data from a stratified ethnic-group sampling and also including groups previously not well covered, such as Iranian Gilaks, Kurds, Mazanderanis and Sistanis, revealed the distinct and rich genetic diversity of the Iranian population, corroborating previous reports based on uniparental markers. The majority of Iran’s ethnic groups comprise largely overlapping genetic autosomal diversity, implicating a shared and largely autochthonous ancestry, designated as the Central Iranian Cluster (CIC). Notably, the CIC also includes Iranian Arabs and Azeris (Fig 1) as well as the religious group of Zoroastrians (Fig 3), being consistent with the suggestion that Zoroastrians have lived in the area of present-day Iran for millennia and had formed an early group of Indo-European speakers. Still, the CIC comprised substantial internal structure, with pairwise F ST values up to an order of magnitude higher than those for more homogeneous populations of similar population size, such as Germany [117], but below the level of substructure reported for Europe, Central Asia, the Near East or Southeast Asia as a whole [45] and much lower than for neighboring Armenia in the Caucasus [118]. Iranian Baluchis, Sistanis, Turkmen and Persian Gulf Islanders showed strong admixture, with the CIC (or its ancestral population) consistently contributing to all of them and contributions from different respective ‘opposite’ ancestral populations, evidencing CIC’s strong impact on human demography in this world region. Since substantial proportions of the Iranian population belong to non-Persian ethnic groups or are admixed, more precise reference to the particular ethnic groups appears mandatory when conducting future genetic studies. In comparison with global and local reference data, the CIC represents a distinct entity comprising an autochthonous genetic component, clustering closely with geographically adjacent populations and assuming a location in the ‘genetic map’ that corresponds to its geographic location at the nexus between South, Central and West Asia, Northern Africa and the Caucasus. This observation is consistent with limited gene flow reported in previous uniparental marker-based studies and adding a further example on the correspondence between genetic diversity and geographic location, such as Europe [73, 119], explicable by genetic drift as well as admixture. The largely autochthonous development of CIC groups, consistent with an early branching from the Eurasian population before the Neolithic [6], is further corroborated by the distinctiveness of these groups in comparison to different time strata represented by aDNA samples, indicating a genetic continuity for at least several past millennia and eventually mirrored by Zoroastrian genomic diversity. Both, Early Neolithic farmers from West Iran and people from the Steppe appear to have made very limited contributions to CIC groups. In turn, the ‘African’ component shared between PG Islanders and some Sub-Saharan populations likely predates the beginning of the Neolithic and, thus, renders PG Islanders as an early autochthonous group that subsequently became strongly admixed with CIC groups. Notably, Iranian Arabs appear to be slightly genetically detached from other Arab populations in West Asia and Northern Africa. The small ancestry component shared between the CIC and Tuscans may mirror early migrations from the Near East although this requires further investigation. Correlating genetic affinity with spoken language yielded evidence for a number of language adoption cases in Iran. CIC’s distinct and autochthonous genetic variation indicates that Indo-European (IE) language(s) were likely adopted by some ancient population in Iran several millennia ago, although it remains unclear if this was driven by previously suggested aggressive warrior-bands migration [120] given the lack of Y-chromosomal data in our study. The observed close genetic proximity, based on genome-wide data, of Turkic-speaking Iranian Azeris as well as of Semitic-speaking Iranian Arabs to IE-speaking groups within the CIC, confirms previous reports on Semitic-speaking groups in Iran [58] and Turkic-speaking Azerbaijanis [91, 121–123]. Given their genetic vicinity to other Arab and Caucasian populations, respectively, this is well explained by admixture between some overwhelmingly contributing ancestral IE population(s) and a minor genetic contributor whose language was adopted in the course of past entries. Finally, the spread of IE-speaking Iranian Baluchis, Sistanis and PG Islanders from the other IE-speaking CIC groups is explicable by repeated admixture of some IE-speaking ancestral population(s) with ancient South or West Asian populations, such as Early Neolithic West Iranians, respectively, while retaining their language, causing its adoption by the admixed offspring. The heterogeneous levels of substantial population substructure as well as of elevated consanguinity in the Iranian population have profound implications for future human genetic studies. They corroborate previous reports on different predisposing variant frequencies across Iran (e.g. [36, 37]) and emphasize the need for an ethnicity-aware approach when performing human genetic studies or genetic counseling in Iran. Population-based association studies should focus on CIC groups to minimize biasing effects due to population stratification, applying to common single-marker analysis but in particular to rare-variant collapsing tests where regional and ethnic group-specificity is to be expected due to the average young age of these variants. Given the genetic diversity even within the CIC, ancestry correction appears mandatory while sample inclusion from the highly admixed groups may increase the risk of biased results. The observed elevated consanguinity in some ethnic groups is in line with previous reports on Iran and other West Asian populations, indicating past and ongoing consanguineous marriage practice and also possibly explaining reported differences between Iranian provinces and residential areas. Family-based linkage or homozygosity-mapping studies should preferentially target groups featuring increased consanguinity levels, namely Iranian Arabs, Baluchis and Sistanis, to increase power especially for studying autosomal-recessive diseases. When studying runs of homozygosity and copy-number variants in diseased individuals, for example in whole-exome and whole-genome sequencing studies, the frequent occurrence of such features in healthy individuals, as shown in this work, requires caution in the interpretation of these features. In summary, Iranians feature distinct genetic variability, resulting from long-standing genetic continuity, as well as substantial genetic heterogeneity and can, thus, not be treated as a single homogeneous entity. Future human genetic studies have to consider ethnic affiliations for sampling and analyses and should expect widespread admixture in both extant and ancient samples. The observed concordance between genetic diversity and geographic location and examples of lineage break up between language and genetic proximity are consistent with the archeological and historical evidence on Iran as occupying a stretch of land that has seen multiple migration and admixture events in the past millennia. By providing genome-wide population data for Western Asia, thereby filling a lack that has characterized this region for over a decade despite its known diversity and prominent place in human history, we hope to encourage future population genetic, evolutionary and medical studies in Iran and beyond.

Material and methods Ethics statement This study has been approved by the Research Ethics Committee of the University of Social Welfare and Rehabilitation Sciences (USWR), Tehran, Iran (approval number IR.USWR.REC.1395.376). Prior to gathering information on sex, ethnicity, demographic and health status, we obtained written informed consent from each individual, according to the guidelines of the Research Ethics Committee, University of Social Welfare and Rehabilitation Sciences (USWR), Tehran, Iran. Iranian study samples We included 1069 healthy unrelated individuals from 11 major Iranian ethnic groups, including 800 from the Iranome project [124] as well as 269 additionally sampled individuals in the study, comprising Iranian Arabs, Azeris, Baluchis, Kurds, Lurs, Gilaks, Mazanderanis, Sistanis, Persians, Turkmen and Persian Gulf Islanders living in Iran (Table 1). Prior to gathering information on sex, ethnicity, demographic and health status, we obtained written informed consent from each individual, according to the guidelines of the Research Ethics Committee, University of Social Welfare and Rehabilitation Sciences (USWR), Tehran, Iran. Individuals were required to have the same ethnic background for at least two generations. The majority of individuals were more than 40 years old at the time of recruitment, lowering the risk of manifesting genetic disorders in later life. All subjects were re-examined by a clinician. This study has been approved by the Research Ethics Committee of USWR, Tehran, Iran. Language family assignment was obtained from Glottolog 3.2 (http://glottolog.org/). Sample processing, genotyping and data quality control Venous blood was taken from individuals. DNA extraction from blood samples was done using the salting out method [125]. Samples were genotyped using the Axiom Precision Medicine Research Array (PMRA) by Life Technologies, comprising about 903,000 markers. Samples were randomly assigned to genotyping array probes without regard to ethnic affiliation in order to avoid batch effects. Life Technologies’ AxiomAnalysisSuite v2.0.0.35 was used for evaluating and genotyping CEL files. After removing low quality samples (quality < 97), genotypes of 1058 samples were assigned using the Axiom_PMRA.na35.annot.db annotation file. Further quality control was performed on those 1058 samples using PLINK [126] v1.9 and R v3.5.1 [127]. Variants were required to have call-rates ≥95% and deviations from Hardy-Weinberg equilibrium with p>10−5. Samples were required to have a call-rate of ≥97%, to not show excessive hetero- as well as homozygosity (<5 sd). Cryptic relatives (mean identity-by-descent [IBD] sharing π>0.4) were detected using PLINK’s—genome option and 20 samples were excluded from the study for representing parent-child pairs, sib pairs or identical individuals. After quality control, the cleaned data set comprised 1021 samples (Table 1) comprising genotypes for 829,779 autosomal markers. The overwhelming majority of sample pairs within an ethnic group (typically ~99% or more) were unrelated or only distantly related (π<0.04125), with only few pairs showing elevated IBD sharing (S11 Table). For some analyses, we additionally considered only markers with common alleles (minor allele frequency ≥5%; 311,262 markers), only markers in no strong linkage disequilibrium (LD; r2≤0.5, 500kb window size, 25 SNPs step size) by using PLINK’s—indep-pairwise option (475,665 markers), or both (203,495 markers). Human reference data sets In order to put the Iranian samples in a global as well as local context, we merged our cleaned data set with those of publicly available reference data sets, using only markers that were present in each of the datasets being merged. For a global comparison, we used 2492 unrelated samples assigned to 26 populations from the 1000 Genomes Project [41–43] (“1000G”; accessed May 2017). For a more localized comparison, we used samples from three different curated data sets, namely 120 samples from the Simons Genetic Diversity Panel (SGDP) [44], 1345 samples from Lazaridis et al. [2], partially including previously published samples, and 45 samples from Broushaki et al [6]. Notably, these reference data also included samples from a wide variety of ethnicities, such as Semitic groups (e.g. Arabs, Assyrians, Jews), Caucasian groups (e.g. Armenians, Georgians, Circassians), Zoroastrians and many others. We further grouped these samples for their corresponding geographic region (S1 Table) and language family (S2 Table). Only markers with genotypes in both the Iranian and the respective reference data set(s) were included in the analysis and underwent additional quality control using the same thresholds as before. Again, for some analyses, markers in strong LD or with infrequent alleles were removed. After QC, the ‘global data set’ (merger with 1000G) included 782,127 markers, while 232,138 common markers remained after additional LD pruning and frequency filtering. The ‘local data set’ (alternative merger with the other three reference data sets) included 59,837 markers in total and 43,198 common, LD-pruned markers, respectively. A growing number of human aDNA samples from Iran and beyond have been published. We compiled 798 aDNA samples from 21 different publications and one pre-print [2, 6, 81, 92–110] (S3 and S4 Tables) for spatial-temporal analysis. Population differentiation and admixture assessment We applied multidimensional scaling (MDS) analysis based on identity-by-state (IBS) allele sharing to the LD-pruned data sets using PLINK’s—mds-plot implementation. PCA analysis was independently performed for each of the considered, possibly merged, data sets, except for the aDNA samples which were projected onto the components obtained from the merged data set of our 1021 extant Iranians and 118 SGDP samples geographically co-localizing with the aDNA samples (S4 Table; S18 Fig). This data set underwent quality control, LD pruning, and frequency filtering using the same thresholds as before. We generated PCs of reference samples running TRACE from LASER [128] v2.04 in PCA mode (-pca 1) using default parameters. Then we projected each aDNA sample independently onto the reference PCA using TRACE [128] v1.03 with default parameters. The number of markers each aDNA sample shared with reference PCA ranged from 30,000 to 80,000. PLINK’s—fst option was used to estimate Weir & Cockerham’s F ST fixation index [129]. For an approximate assessment of the upper limit of the impact of population substructure on genetic association studies in the Iranian population, we deliberately assigned, for each pair of ethnic groups, case status to all samples from one group and control status to all from the other and subsequently calculated the genomic inflation factor [130] (GIF), where values of 1.0 correspond to no inflation, by using PLINK’s—adjust option. For exploratory admixture and migration analysis, we ran ADMIXTURE [111] v1.3.0 in parallel for K = 2, …, 20 using random seeds and TreeMix [112] v1.13 through the Treemix_bootstrap.sh script of the BITE R package [131] v1.1.0004 allowing for 0, 1, 2, 5, 10 and 15 migration events to be replicated 100 times and made consensus trees based on replications using PHYLIP [132] v3.697. The final tree was then plotted using treemix.bootstrap from BITE. The qp3Pop program of ADMIXTOOLS v5.0 package with default parameters was used to calculate f 3 statistics [113]. Language autocorrelation analysis Autocorrelation analysis for language family with respect to genetic distance based on the Euclidian distance in the first two principal components and Moran’s I [133], obtained from running the TRACE software from LASER v2.0 [128] on the local data set, was performed using the Moran.I function from R package ape v5.1 [134]. To this end, language families were assigned numeric class values (1,…,8). To avoid spurious effects due to this numeric (instead of categorical) coding, which would imply order and distance between classes, we performed 100 random permutations of the assigned values, thereby destroying potential biases introduced by arbitrary numbering, and report the respective distributional statistics. Selection analysis Tajima’s D was estimated with VCFtools v0.1.13 (https://vcftools.github.io/man_latest.html) using the—TajimaD option with a window size of 100 kb and the—from-bp/—to-bp commands for a sliding window analysis. We also performed an integrated Haplotype Score (iHS) analysis [115] of the LCT region on chromosome 2. To this end, haplotypes were estimated using ShapeIt v2.r790 [135] based on the 1,000 Genomes Phase 1 haplotype reference panel and genetic map of chromosome 2 (downloaded from: https://mathgen.stats.ox.ac.uk/impute/data_download_1000G_phase1_integrated.html). A total of 34,746 SNPs on chromosome 2 coincided between our data set and the reference panel. The iHS scan was performed using R package rehh v2.0.2 [136, 137] (downloaded from: https://cran.r-project.org/web/packages/rehh/index.html). Autozygosity and copy-number variant assessment Inbreeding coefficients (F I ) were estimated using PLINK’s—ibc option (‘Fhat3’; [138]) based on LD-pruned autosomal markers and separately for each ethnic group. Furthermore, we defined runs of homozygosity (ROHs) using PLINK v1.9 (LD-pruned autosomal markers; ethnic groups combined) and GARLIC [31, 139] v1.1.4 (autosomal markers; separately for each ethnic group) and autozygous genomic regions using IBDseq [32] (LD-pruned autosomal markers; separately for each ethnic group), using default options and applying them to separate data subsets containing only a single population or ethnic group, respectively. We used the Axiom Analysis Suite (AxAS v4.0) with default options in order to detect copy-number variants (CNVs). We divided the set of 1021 Iranian samples into 5 groups where each group comprised similar proportions of males and females from the 11 ethnic groups. Samples in each group were used to construct a reference for CNV detection, subsequently running the CNV detection for the same groups. CNVs were required to comprise at least 25 and 50 markers for homozygous and heterozygous variants, respectively. Statistical analysis All statistical analyses were performed and graphs were created using R with in-house scripts, unless noted otherwise. Two-dimensional kernel density estimates were obtained using the Hpi and kde functions from the ks package v1.11.3 [140] for R. Map data were obtained from GADM v2.8 (November 2015; www.gadm.org) and maps were plotted using functions in the sp package v1.3–1 [141, 142] for R. Bar plots were created using functions in ggplot2 v3.0.0 [143] for R.

Acknowledgments We thank J. Du for assistance. We also thank the Regional Computing Center of the University of Cologne (RRZK) for providing computing time on the DFG-funded High Performance Computing (HPC) system CHEOPS as well as technical support. We also thank the work groups who have made their ancient genome data publicly available. Data and materials availability. The quality-controlled data set comprising 1021 individuals and 829,779 autosomal markers is available from http://genotype.iranome.ir/. Data access is restricted exclusively to population genetic analyses. Data access requests to the dedicated Data Access Committee (DAC) should be send to: Mrs. Zohreh Fattahi, University of Social Welfare and Rehabilitation Sciences, Tehran, Iran (zohrehfattahi@gmail.com). Requests have to include a description of the planned project and the planned use of the provided data set.