Abstract Previous genetic, anthropological and linguistic studies have shown that Roma (Gypsies) constitute a founder population dispersed throughout Europe whose origins might be traced to the Indian subcontinent. Linguistic and anthropological evidence point to Indo-Aryan ethnic groups from North-western India as the ancestral parental population of Roma. Recently, a strong genetic hint supporting this theory came from a study of a private mutation causing primary congenital glaucoma. In the present study, complete mitochondrial control sequences of Iberian Roma and previously published maternal lineages of other European Roma were analyzed in order to establish the genetic affinities among Roma groups, determine the degree of admixture with neighbouring populations, infer the migration routes followed since the first arrival to Europe, and survey the origin of Roma within the Indian subcontinent. Our results show that the maternal lineage composition in the Roma groups follows a pattern of different migration routes, with several founder effects, and low effective population sizes along their dispersal. Our data allowed the confirmation of a North/West migration route shared by Polish, Lithuanian and Iberian Roma. Additionally, eleven Roma founder lineages were identified and degrees of admixture with host populations were estimated. Finally, the comparison with an extensive database of Indian sequences allowed us to identify the Punjab state, in North-western India, as the putative ancestral homeland of the European Roma, in agreement with previous linguistic and anthropological studies.

Citation: Mendizabal I, Valente C, Gusmão A, Alves C, Gomes V, Goios A, et al. (2011) Reconstructing the Indian Origin and Dispersal of the European Roma: A Maternal Genetic Perspective. PLoS ONE 6(1): e15988. https://doi.org/10.1371/journal.pone.0015988 Editor: Robert C. Fleischer, Smithsonian Institution National Zoological Park, United States of America Received: August 4, 2010; Accepted: December 2, 2010; Published: January 10, 2011 Copyright: © 2011 Mendizabal et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: IM was supported by a fellowship by the Basque Government (Hezkuntza, Unibertsitate eta Ikerketa Saila, Eusko Jaurlaritza, BFI107.4). Fundação para a Ciência e Tecnologia (FCT) supported CV, VG, AG and LA through grants SFRH/BD/63343/2009, SFRH/BD/36045/2007, SFRH/BPD/43646/2008 and SFRH/BPD/65000/2009, respectively. This work was partially financed by FCT through project PTDC/ANT/70413/2006 and POCI 2010, Programa Operacional Ciência e Inovação; Dirección General de Investigación, Ministerio de Educación y Ciencia, Spain (CGL2009-14944/BOS); Direcció General de Recerca, Generalitat de Catalunya (2009SGR1101) and the Austrian Science Fund (FWF): TR397. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction The dispersion of the Roma (Gypsies) through Europe represents one of the most remarkable people movements in recent historical times. The current estimates of the total Roma population size in Europe range from 4 to 10 million, with the largest numbers concentrated in Central and South-eastern Europe [1], [2]. The Roma constitute a diasporic population without any reliable written records, neither historic nor genealogic. Mainly of nomadic lifestyle and with endogamous social practices, the geographically dispersed Roma populations have been socially marginalized and historically persecuted [3]. Linguistic, anthropological, historical and genetic evidences point out India as the origin of the Roma populations, which may have left the continent approximately between the 5th–10th centuries [3]. After leaving India, the Roma migration route passed through Persia, Armenia, Greece and the Slavic-speaking parts of the Balkans [3]. The acknowledgment of the Roma establishment in the Balkan region is uniformly accepted to have taken place during the 11th and 12th centuries, where they remained for two centuries before they started spreading out to all over Europe [2], [3]. The dispersion throughout the continent was a very fast process since by the 15th century Roma had reached the Northern and Westernmost fringes of Europe. Indeed, historical documents testify that by the early 15th century Roma were present in Catalonia and by the end of the century they were spread all over Spain and Portugal. The most important gateway for the entrance of Roma in Iberia is believed to have been the Trans-Pyrenees route. Three more recent migration waves have to be taken into account in the formation of the present-day Roma populations from Western Europe. First, the dispersion that occurred during the end of the 19th century, after the abolition of Roma slavery in the Romanian Old Kingdom [1], [3], [4]; second, out of Yugoslavia, during the 1960s and 1970s; and third, during the last decade, following the political and economic changes in Eastern Europe [5]. Previous genetic studies have confirmed the Indian origin of the Roma and have also described differential admixture with the European neighboring groups [6], [7], [8], [9], [10]. However, these studies lacked accurate representation of Western Roma groups [6] and it was not until recently that genetic studies on Iberian Roma were published [11], [12]. Nonetheless, the specific origin of the Roma within the Indian continent has not been elucidated yet. Linguistic evidences point out to North-western India as the source of the proto-Roma population, specifically to the Indo-Aryan ethnic groups in that area [4]. Multilocus comparison of classical genetic markers [13] showed strong affinities of the Roma with Rajput and Punjabi populations from North-Western India. Additional genetic evidence relating the Roma populations to this geographical area comes from the study of a private mutation causing primary congenital glaucoma in the Roma which has been also described in a family belonging to the Jatt, an ethnic group of Indo-Aryan descent from the Pakistani Punjab province [14]. In previous studies, the selection of Indian/Pakistani populations was influenced by linguistic theories on the Roma origins and/or by the availability of the genetic data from the Indian subcontinent [15]. Therefore, the need for an unbiased coverage of the Indian genetic data is necessary to locate the place of origin of the Roma Diaspora in the subcontinent. The present study aims to survey the maternal genetic legacy in the Roma in order to achieve a deeper knowledge of their history. We provide additional 214 mitochondrial DNA (mtDNA) complete control region sequences from Roma individuals from the Iberian Peninsula and analyze them in the context of the previously published studies on other Roma populations. The non-recombinant nature and the phylogeographic resolution of the mtDNA permits not only to survey the genetic affinities among different Roma groups and host populations, but to study the migration routes followed by the Roma and the putative origin of the Roma in the Indian subcontinent.

Materials and Methods Ethic statements Written informed consent was obtained from the participants and analyses were performed anonymously. The project obtained the ethics approval from the Institutional Review Board of the institutions involved in the sampling (Conselho Nacional de Ética para as Ciências da Vida (CNECV) in Portugal, and Comitè Ètic d'Investigació Clínica – Institut Municipal d'Assistència Sanitària (CEIC-IMAS) in Spain). Sample collection A total of 214 unrelated individuals from the Iberian Peninsula were analyzed. 138 individuals were sampled in Portugal from 18 different communities in 11 districts, whereas 76 subjects were collected in Barrio de la Mina neighborhood in Sant Adrià de Besòs, Barcelona, Catalonia, Spain. All the individuals self-declared as “ciganos/gitanos” (Portugal/Spain) and were interrogated about family history in order to avoid close kinship. Mitochondrial DNA amplification and sequencing DNA was extracted from fresh blood by standard phenol-chloroform method. The complete mitochondrial control region (16024–576 bp) was amplified by PCR using the primers L15997 (5′-CACCATTAGCACCCAAAGCT-3′) and H599 (5′-TTGAGGAGGTAAGCTACATA-3′). Both hypervariable segments were sequenced in both directions, for HVR-I (hypervariable region I, positions 16024-16569) the reverse primer was H17 (5′-CCCGTGAGTGGTTAATAGGGT-3′), whereas for HVR-II (positions 1-576) the forward primer was L16555 (5′-CCCACACGTTCCTAAAT-3′). In addition, in the Spanish Roma samples, five Single Nucleotide Polymorphisms (SNPs) in the coding region of mtDNA (H10400, L10873, L12308, L12705 and L11719) were determined by SNaPshot™ ddNTP Primer Extension Kit (Applied Biosystems) as described in Bosch et al. [16]. Two additional SNPs (L7028 and L11251) were genotyped in the sequences classified as HV/H and R/JT respectively. MtDNA variation was compared to the revised Cambridge Reference Sequence (rCRS) [17] and mtDNA sequences were classified into haplogroups according to Van Oven and Kayser [18]. Samples belonging to haplogroup H or with a dubious ascription to this haplogroup were further genotyped for a set of coding region SNPs [19] in order to refine the classification. Statistical Analyses In order to locate the Iberian Roma in the context of other European Roma and their corresponding host populations, a database of 1,890 hypervariable region I (HVR-I) sequences (positions 16090 to 16365) was built from previously published studies (hereafter referred to as Roma-host database). In addition to the 138 Portuguese and 76 Spanish Roma from this study, the Roma-host database contained other sequences gathered from the literature: 39 Spanish Roma [6], [20], 232 Bulgarian and 18 Lithuanian Roma [6], 69 Polish Roma [9], and 205 Hungarian Roma [7]. To cover the corresponding European host populations, we collected 118 Portuguese individuals (unpublished data), 68 Spanish [21], 141 Bulgarian [22], 162 Lithuanian [23], 413 Polish [24], and 211 Hungarian [7]. The Bulgarian Roma populations from Gresham et al. [6] were grouped according to the original paper classification (“Bulgaria 1” stands for Roma groups who settled early in Bulgaria, whereas “Bulgaria 2” and “Bulgaria 3” stand for Roma groups settled in Bulgaria coming originally from Wallachia/Moldavia in the 17th–18th centuries and late 19th century respectively). Intrapopulation genetic diversity parameters such as number of different sequences (K), sequence diversity values (Ĥ) [25], number of polymorphic sites (S) and nucleotide diversity (π) [25], [26] were calculated for the HVR-I using Arlequin software v3.1 [27]. Additionally the weighted intralineage mean pairwise differences (WIMP) were also computed, which measures mean pairwise differences within each lineage but weighting for its corresponding frequency [28]. Finally the female effective-population sizes were assessed by the computation of the estimators θ π , θ K and θ S (θ = 2N fe μ where N fe is the female effective-population size and μ is the mutation rate). Whereas θ S is based on the number of segregating sites, θ K relies on the observed number of different lineages. Since the mutation rate for the HVR-I should be the same in all populations, differences in θ values reflect differences in the female effective-population sizes among populations [29]. Pairwise differences between populations were represented in a Non-Metric Multidimensional Scaling plot (NMDS) by using STATISTICA 7 package (http://www.statsoft.com) with default starting configuration. Population genetic structure was tested through analysis of molecular variance (AMOVA) [30] using Arlequin v3.1 software [27] to shed light on the migration routes that Roma populations may have followed in Europe by comparing country of residence to migration routes. Taking advantage of the phylogeographic information of the mitochondrial sequences and following the same approach as in Mendizabal et al. [31], admixture between Roma and European host populations was estimated. In addition, several Indian geographic areas were evaluated as possible ancestral homeland of the Roma. Two datasets were compiled for these purposes: the extended database of host European sequences with 5,096 individuals from Iberia, Balkans, Hungary, Poland and Baltic countries (from Additional File 1 in Mendizabal et al. [31]), whereas Indian sequences were collected from Dubut et al. [32] (n = 3,751, excluding Sri Lanka). Each of the datasets was subdivided into subcontinental regions and the probability of origin at each region was calculated as where, n is the number of Roma sequences with matches (≥1) in the whole subcontinental dataset of India; k i , the number of times the sequence i is found in the Roma sample; p is , the frequency of the sequence i in the specific region of India; and p ic , the frequency of the sequence i in the whole subcontinental Indian dataset. Standard deviations for each of the estimations were computed as A median-joining network was generated to infer phylogenetic relationships between European Roma and Indian mtDNA lineages (HVR-I, positions 16090–16365) using Network 4.5.0.0 software (http://www.fluxus-engineering.com/). Mutation weights were in accordance with Santos et al. [33], excluding insertions and deletions. The time to the most common ancestor (TMRCA) of M5a1 subhaplogroup was estimated based on the average number of mutations accumulated from an ancestral sequence as a linear function of time and mutation rate. The age estimates were obtained with Network 4.5.0.0 by considering one transition per 18,845 years in the sequence range of 16090–16365 [34].

Discussion The pattern of mtDNA diversity in the Roma from Europe retains remarkable signs of their recent demographic past. By the fourteenth century, many Roma groups are recorded to be established in the Balkan Peninsula. Departing from this region, a chain of group fragmentation and migration events would have lead to their spread throughout Europe, in such a large-scale and fast movement that only one century elapsed until their presence was documented even in the most peripheral regions of Europe, from the North-east to the South-west corners of the continent. It is acknowledged that during this itinerant period, Roma usually travelled in small groups before the arrival and settlement in new places, from where often new waves of branching and migration across the region initiated [2], [42]. In agreement with these accounts, our results show that the maternal effective population sizes in the Roma are strikingly low in comparison to the host populations. Whereas host European populations and Indians [32] show strong molecular signatures of population expansion, Roma groups remained constant. Additionally, populations which have experienced one or several founder events are expected to show lower θ K values than those from the source populations. Since Indian populations tend to exhibit much higher values than the European hosts here considered [32], the demographic parameters found in Roma testify strong signals of founder effects compared to both putative parental populations. Of note that among the European Roma diversity decreases from Eastern (represented by the Bulgarian and Hungarian Roma) towards Western and Northern groups, fitting an expectable accumulation of drift effects during successive population splitting and migrations along the dispersion of Roma within Europe. Despite the persistence of founder Romani maternal lineages in different Roma groups, bottleneck events profoundly drifted the frequency of haplogroups contained in the ancestral pool, contributing to generate strong differentiation between groups. Our results further suggest that the Iberian, Polish, and Lithuanian Roma were derived from the same migration wave, which, probably due to the low effective population size of fragmented groups, resulted in strong differentiation from the Central/Balkan Roma from which it was originated. This differentiation process implied the loss of lineages in parallel with a random dramatic increase of other ones. The random accumulation of founder effects does not permit the accurate identification of all possible founder lineages in the European Roma, since many of them may be present at low frequencies in the Balkan Roma but absent due to loss in Roma from other regions in Europe. Even so, the conservative identification of the founder lineages M5a1, M18, M25, M35b, U3, H7, J1b, J1b3, J1c1, X2e and X2d allowed us to obtain maximum admixture proportions with host populations. Overall, the incorporation of female lineages from non-Roma appears to have been low since most of the sequences present in current Roma are rare in the European host populations, suggesting that the majority of lineages were already present in Roma before their arrival in Europe. The phylogeography of the Roma founder lineages demonstrates a broad West Eurasian origin (except those belonging to macrohaplogroup M) not confined to Western Europe. In fact, haplogroups such HV, pre-HV, J-T, U-K, I, W and X are present in highest frequencies in the Anatolian/Caucasus and Iranian regions [38] being moreover still present at relatively high frequencies in the Indus Valley and Central Asia [38], [43]. Given this distribution, higher phylogeographic resolution is needed to distinguish among lineages from such a broad geographical area. The upper limits of admixture rates in the maternal genetic pool of the Roma range from low (11%) to moderate (50%). Unfortunately, similar studies on paternal lineages of European Roma populations are confined to the Iberian Peninsula [11]. Our estimates of maternal admixture in Iberian Roma (30%) are slightly lower than estimates for the Y-chromosome (47%) reported by Gusmão et al. Anthropological records show that marriages with non-Roma are usually avoided in the Roma communities [1], although non-Roma females are more frequently accepted in the Roma groups than non-Roma males [44], [45]. Unexpectedly, we detect less percentage of admixture rates in the maternal pool than that reported on paternal lineages. However, the high values for both estimates show that the amount of admixture observed contradict the stereotype of Roma constituting closed endogamous groups. Our results may indicate that social rules practiced by the Roma may have been varying in time and space according to different social constrains. Nevertheless, the proportions of admixture in the maternal and paternal genetic pools have to be considered rough approximations since they depend on the phylogeographic resolution on the mtDNA sequences and Y-chromosome haplotypes. Further studies providing better phylogeographic resolution and better coverage of Indian and European populations may give more accurate estimates of admixture rates. This would lead to confirm if asymmetry exists between maternal and paternal lineages and whether different European Roma groups show similar patterns. In contrast, the more restricted phylogeography of haplogroup M points to the Indian subcontinent as the origin of a substantial fraction of Roma maternal lineages. A match analysis with the Roma M-founder lineages using a database of more than 3,700 Indian sequences, allowed us to identify North-Western India, and specifically the Punjab region, as the putative homeland of the Roma Diaspora. This finding is in accordance with previous linguistic and cultural evidences [4], as well as with the recent genetic hint provided by the identification of a private mutation in the Roma shared by a Jatti family in the Punjab province of Pakistan [14]. To our knowledge, this is the first comprehensive study comparing different Indian subcontinental areas in order to assess the origin of the Roma. Better coverage of India and surrounding areas in future studies will allow to determine the contribution of different tribes or castes from the Punjab area to the ancient Roma population who left India. In summary, our findings confirm the high genetic heterogeneity of the Roma groups which has been shaped by several founder events combined with low effective population sizes, creating a pattern that mimics the migration routes the Roma followed within Europe. We show that most maternal Roma lineages are of non-European origin, pointing to a limited admixture with surrounding populations. Finally, the phylogeographic information provided by the Indian female lineages found in the Roma led us to trace back the ancient homeland of the European Roma to the Punjab state, in North-western India, confirming previous linguistic and anthropological accounts.

Acknowledgments We are really grateful to all volunteers who contributed their DNA to this study. We thank Ana González-Neira (CNIO, Madrid) for her valuable help in the sampling of the Spanish Roma. We thank Mònica Vallés, Stéphanie Plaza and Roger Anglada (Universitat Pompeu Fabra) for technical assistance and Urko M. Marigorta (Universitat Pompeu Fabra) for very helpful comments.

Author Contributions Conceived and designed the experiments: LG AA MJP DC. Performed the experiments: IM CV A. Gusmão CA VG A. Goios LA. Analyzed the data: IM CV FC WP MJP DC. Wrote the paper: IIM MJP DC.