The Roma people, living throughout Europe and West Asia, are a diverse population linked by the Romani language and culture. Previous linguistic and genetic studies have suggested that the Roma migrated into Europe from South Asia about 1,000–1,500 years ago. Genetic inferences about Roma history have mostly focused on the Y chromosome and mitochondrial DNA. To explore what additional information can be learned from genome-wide data, we analyzed data from six Roma groups that we genotyped at hundreds of thousands of single nucleotide polymorphisms (SNPs). We estimate that the Roma harbor about 80% West Eurasian ancestry–derived from a combination of European and South Asian sources–and that the date of admixture of South Asian and European ancestry was about 850 years before present. We provide evidence for Eastern Europe being a major source of European ancestry, and North-west India being a major source of the South Asian ancestry in the Roma. By computing allele sharing as a measure of linkage disequilibrium, we estimate that the migration of Roma out of the Indian subcontinent was accompanied by a severe founder event, which appears to have been followed by a major demographic expansion after the arrival in Europe.

Funding: This work was supported in part by the OTKA 73430 and K 103983 and the microarray facility was supported by a core facility grant of the Medical Faculty of the University of Tübingen. PM, NP and DR were funded by U.S. National Science Foundation HOMINID grant 1032255, and by National Institutes of Heath grant GM100233. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2013 Moorjani et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Here we analyze whole genome SNP array data from 27 Roma samples belonging to six groups sampled from 4 countries in Europe (three separate ethnic groups from Hungary, and one group each from Romania, Spain and Slovakia). Our aim was to address the following questions: (1) What is the source of the European ancestry in the Roma? (2) What is the relationship of the Roma to the present-day South Asian populations? (3) What is the proportion and timing of major gene flow into this population? (4) Can we characterize the founder events that have occurred in the history of this population?

Genetics provides a complementary source of information to data from history, archaeology and linguistics. Y-chromosome marker H1a-M82 and mitochondrial haplogroups M5a1, M18 and M35b that are thought to be characteristic of South Asian ancestry, are present at high frequency in Roma populations [7] , [8] . However, there is no consensus about the specific ancestral group/geographic region within South Asia that is most closely related to the ancestral population of the Roma. A recent study based on Y-chromosome markers showed that the Roma descended from southern Indian groups [11] , which contradicts previous reports based on mtDNA haplogroups that have placed the origin of Roma in Northwest India. While mtDNA and Y chromosome analyses provide valuable information about the maternal and paternal lineages, a limitation of these studies is that they represent only one instantiation of the genealogical process. Autosomal data permits simultaneous analysis of multiple lineages, which can provide novel information about population history.

Anthropological and linguistic studies have documented striking similarities between the cultures and languages of various Indian groups and Roma. Social structure in Roma groups is similar to the castes of India, where the groups are often defined by profession [2] , [3] . Like many Indian populations, the Roma practice endogamy and individuals of one Roma clan (sub-ethnic group) preferentially marry within the same group, and marriages across clans are proscribed [3] . Anthropological studies have also suggested a link between the Roma and Banjara (nomadic gypsy groups) residing in India [3] (even though linguistic analysis of the Banjari or Lamani, languages spoken by the Indian nomadic groups, have little similarity to Romani [6] ). Comparative linguistics have further suggested that Northwestern Indian languages like Punjabi or Kashmiri or Central Indian languages like Hindi are most closely related to Romani [9] , [10] .

Historical studies have suggested that the Roma are originally from India, and that they migrated to Europe between the 5 th and 10 th century [3] . It has been argued that their migration route included Persia, Armenia, Anatolia, and Greece [3] , [4] . The Roma then settled in multiple locations within Europe and were widespread in Europe by the 15 th century; descendents of these migrants currently live primarily in the Balkans, Spain, and Portugal [5] .

The Roma (also called Romani) are a unique and diverse population that live in Europe, Near East, Caucasus, and the Americas. They speak more than 60 dialects of a rapidly evolving language called Romani and belong to various social and religious groups across Europe. Their census size has been estimated to be in the range of 10–15 million [1] , with the largest populations in Eastern Europe [2] . They do not have written history or genealogy (as Romani does not have a single convention for writing) and thus most of the information about their history has been inferred based on linguistics, genetics and historical records of the countries where they have resided.

Results

Genome-wide Ancestry Analysis of the Roma We applied Principal Component Analysis (PCA) using the SMARTPCA software [12] and the clustering algorithm ADMIXTURE [13] to study the relationship of Roma to other worldwide populations in a merged dataset of Roma and HapMap populations. In PCA, the Roma fall between the South Asians (Gujaratis) and Europeans, consistent with Roma deriving ancestry both South Asians and Europeans and in line with previous mtDNA and Y chromosome analyses [7], [8] (Figure 1). The ADMIXTURE software, which implements a maximum likelihood method to infer the genetic ancestry of each individual modeled as a mixture of K ancestral groups, produces very similar inferences [13]. At K = 6 (which has the lowest cross-validation error), we observe clustering based on major continental ancestry. Similar to the PCA results, the Roma individuals cluster with South Asians and Europeans (Figure 1, Figure S1). We also examined pairwise average allele frequency differentiation (F st ) between Roma and major continental groups (see Table S1) and observed that they have the lowest F st with other European groups. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Relationship of Roma with other worldwide populations. We applied PCA and ADMIXTURE to study the relationship of Roma with the HapMap and South Asian populations. In PCA, each point represents an individual, and in ADMIXTURE, each line represents an individual. (a) shows the PCA and ADMIXTURE results for clustering of Roma and HapMap populations. The populations codes are as follows: Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK), Maasai in Kinyawa, Kenya (MKK), Utah residents with Northern and Western European ancestry (CEU), Toscani in Italia (TSI), Han Chinese in Beijing, China (CHB), Japanese in Tokyo, Japan (JPT), Chinese in Metropolitan Denver, Colorado (CHD), Gujarati Indians in Houston, Texas (GIH), African ancestry in Southwest USA (ASW) and Mexican ancestry in Los Angeles, California (MEX), and (b) shows the PCA and ADMIXTURE results for clustering of Roma and South Asian groups. We limit the sample size of all groups (except Roma) to 20 individuals. https://doi.org/10.1371/journal.pone.0058633.g001 Previous studies have shown that the HapMap Gujarati population is not an ideal surrogate for the variation in India, as this group is heterogeneous and has recent West Eurasian ancestry [14]. To study the relationship of Roma to South Asians, we repeated the clustering analysis with Roma, Europeans and 28 South Asian groups (24 Indian groups from the India Project (we remove Siddis as they have recent African ancestry), Pathan and Sindhi from HGDP and Punjabi and Gujarati from POPRES). As previously seen in PCA, we observed that all Indians fall on a cline of variable relatedness to Europeans and indigenous Andamanese population (Onge) [14]. The Roma also fall on this cline but they appear to be closest to the European cluster compared to any other South Asian group included (Figure 1b). Similar results were observed in our ADMIXTURE analysis (Figure 1b, Figure S1). Based on the PCA and ADMIXTURE analysis, we excluded three Roma outlier samples from further analyses, as they appeared to have very recent admixture from neighboring non-Roma European populations (likely in the past few generations). We applied the 4 Population Test [14] to formally examine if the Roma have evidence of a mixture of European and South Asian ancestry. We used individuals of Northern European ancestry (CEU) and Andamanese (Onge) as surrogates for the European and South Asian ancestral populations respectively. We tested whether the phylogenetic tree (Africans, Europeans, South Asians, Roma) is consistent with the data. We choose Onge for this analysis, since, unlike their distant relatives on the Indian mainland, they do not have any evidence of West Eurasian related admixture [14]. Applying the 4 Population Test, we observed highly significant violations of the expected phylogenetic tree topology, confirming that the Roma are admixed; that is, they have ancestry from both South Asians and Europeans (Table S2). We note that this test does not distinguish between European and West Asian ancestry and qualitatively similar results would be observed if we replace CEU with any other West Eurasian population (other groups from Europe, Middle East, Central Asia or Caucasus), hence we refer to this ancestry component as “Ancestral West Eurasian (AWE)”. To quantify the magnitude of the South Asian and West Eurasian ancestry in the Roma, we applied F 4 Ratio Estimation [15] using the model shown in Figure S2, which can estimate admixture proportions in the absence of data from good surrogates of the ancestral populations. Here, we used CEU and Adygei (a population from the Caucasus) represent the West Eurasian component and Onge to represent the ancestral South Asian component (referred to as Ancestral South Indian (ASI)) as they do not have any West Eurasian ancestry [14]. The F 4 Ratio Estimation is known to work only if we have access to data from populations that form a clade with the unadmixed ancestral populations. Since all populations in mainland India are admixed none are appropriate for this test [14]. To further evaluate our model of population relationships in Figure S2, we used admixture graph [15] and found that this model provides a good fit to the data. Applying the F 4 Ratio Estimation to Roma (pooling all samples together), we estimate that the Roma have on average 77.5±1.8% West Eurasian related ancestry (standard errors were computed using a Block Jackknife with a block size of 5 cM) (Table S2). As all Indian groups harbor ancestry from a West Eurasian related populations (previously referred to as Ancestral North Indian (ANI) ancestry [14]), we note that some of West Eurasian related ancestry we detect in Roma likely derives from India itself–from the ANI–while other parts may be from European or Middle Eastern admixture (post exodus from India).

Estimating a Date of European Admixture in the Roma To infer the date of the gene flow, we applied a modified version of ROLLOFF [16], which uses the decay of admixture linkage disequilibrium (LD) to estimate the time of admixture. ROLLOFF computes SNP correlations in the admixed population and weights the correlations by the allele frequency difference in the ancestral populations such that the signal is sensitive to admixture LD. While this method estimates accurate dates of admixture in most cases, we observed that it is noticeably biased in case of strong founder events post admixture (Table S3). The bias is related to a normalization term that exhibits an exponential decay behavior in the presence of a strong founder event, thus confounding the admixture date (see details in Note S1, Figure S3). We propose a modification to the ROLLOFF statistic that removes the bias (Note S1, Table S3). In addition, the new statistic computes covariance instead of correlation between SNPs; this does not affect the performance of the method but makes it mathematically more tractable. Throughout the manuscript, we use the modified ROLLOFF statistic (R(d)) unless specified otherwise. Simulations show that this statistic gives accurate and unbiased results up to 300 generations (Note S2, Figure S4). A feature of ROLLOFF is that it uses allele frequency information in the ancestral populations to amplify the admixture signal relative to background LD. While data from the ancestral populations is not available for Roma, this information can be obtained by performing PCA using present day Europeans and South Asians. Simulations show that using PCA-based SNP loading effectively captures the allele frequency differentiation between the ancestral populations and can be used for estimating dates of mixture (Note S2, Figure S5). Applying the ROLLOFF (using R(d)) to the Roma samples with the SNP loading estimated using PCA of Europeans (CEU) and 16 Indian groups (limited to groups that fall on the main cline of West Eurasian relatedness in PCA so that the signal is not confounded by other ancestry components), we estimate that the West Eurasian admixture in Roma occurred 29±2 generations or about 780–900 years ago, assuming one generation = 29 years [17] (Figure 2). This is consistent with mixture having occurred only after the historically recorded arrival of the Roma in Europe between 1,000–1,500 years ago [3]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Admixture date estimation. We performed ROLLOFF (using R(d)) on the Roma samples (n = 24). We plot the weighted covariance as a function of genetic distance, and obtain a date by fitting an exponential function with an affine term: , where d is the genetic distance in Morgans and n is the number of generations since mixture. We do not show inter-SNP intervals of <0.5 cM since we have found that at this distance admixture LD begins to be confounded by background LD. https://doi.org/10.1371/journal.pone.0058633.g002 A potential complication is that the date we are estimating may also be reflecting earlier admixture with ANI in India and any gene flow from Middle Eastern populations that occurred after the Roma exodus from India. The allele frequency of ANI and Middle Eastern populations are correlated to the allele frequencies of the Europeans used in the analysis, and hence the date of admixture inferred using a single exponential function should be interpreted as an average date of all West Eurasian related gene flow events. When we consider a two-pulse model of admixture (by fitting a sum of two exponential functions to infer the dates), we obtain dates of 37 and 4 generations. The older date corresponds to about 1,000 years before present – again consistent with the historical record – and both dates are much more recent than any estimates obtained by applying ROLLOFF in India. This suggests that the admixture we are detecting is genuinely related to events that occurred after the exodus from India.

Source of the European Ancestry in Roma To learn about the relationship of the Roma to European populations, we estimated the pairwise Identity-by-descent (IBD) sharing between each Roma individual and non-Roma European individual. We grouped the European samples from POPRES, HapMap and HGDP into four major regional groups: Northern (n = 595), Southern (n = 649), Eastern (n = 82), and Western Europe (n = 241). IBD segments (>3 centimorgans (cM)) were detected using GERMLINE [18]. Next, we computed an average pairwise sharing distance between Roma and the European groups in each region (see Methods). We observed that Roma exhibit the highest IBD sharing with individuals from Eastern Europe (Figure 3a). When we perform stratified analysis (where Roma individuals from each country were considered separately), we observed that the highest sharing for each Roma group is still with Eastern Europeans (even for Roma individuals from Spain) (Figure S6). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. The European and South Asian sources of Roma ancestry. We computed a genome-wide average IBD sharing distance between Roma (all samples combined in one group) and other regional groups. Details of the regional grouping are described in Methods. (a) shows the average pairwise IBD sharing between Roma and Europeans (grouped into four regional categories), (b) shows IBD sharing average pairwise IBD sharing between Roma and South Asians (grouped into 8 regional categories). https://doi.org/10.1371/journal.pone.0058633.g003

Source of the South Asian Ancestry in Roma To learn about the source of the South Asian ancestry in Roma, we inferred the pairwise IBD sharing distance between Roma and various South Asian groups. Again, we performed GERMLINE analysis to compute the average pairwise sharing distance between Roma and 28 South Asian populations (from India Project, HGDP and POPRES). To simplify the analysis, we classified the samples into 8 groups based on geographical region within India: North (n = 38), Northwest (n = 225), Northeast (n = 8), Southwest (n = 16), Southeast (n = 29), East (n = 11), West (n = 32), and Andamanese (n = 16). We observe that the Roma share the highest proportion of IBD segments with groups from the Northwest of India (Figure 3b). Interestingly, the two Northwest Indian groups that show the highest relatedness to Roma (Punjabi, Kashmiri Pandit) are also the populations that have highest proportion of West Eurasian-related (ANI) ancestry in our sample. To control for the possibility that the high IBD sharing could be an artifact related to high ANI ancestry, we recalculated the IBD sharing regressing out the ANI ancestry proportion and observed that the Roma continue to share the highest IBD segments with the Northwest Indian groups (Note S3). These findings are consistent with analyses of mtDNA that also place the most likely South Asian source of the Roma in Northwest India [8]. An important caveat is that we have large variation in the number of samples from each regional group, with some groups containing only a handful of samples. In order to control for the sample sizes, we performed bootstrap analysis drawing a random sample of up to 30 individuals from each regional group and recomputed the IBD statistics. We repeated the process 100 times and estimated the mean and standard error (Note S3). We observed that Roma continue to share the highest IBD segments with Northwest Indian groups. There is very little variability across the 100 runs, suggesting that this analysis may also be picking up founder events shared between Roma and Indian groups (Note S3, Figure S7).