GPS implementation

The GPS method consists of two steps. In the first step, carried out once, we constructed a diverse panel of worldwide populations and analysed them using an unsupervised ADMIXTURE analysis. This analysis yielded allele frequencies for K hypothetical populations whose genotypes can be simulated to form putative ancestral populations. Next, from the data set of worldwide population, we constructed a smaller data set of reference populations that are both genetically diverse and have resided in their current geographical region for at least few centuries. These populations were next analysed in a supervised ADMIXTURE analysis that calculated their admixture proportions in relation to the putative ancestral populations (Fig. 1).

Figure 1: Admixture analysis of worldwide populations and subpopulations. Admixture analysis was performed for K=9. For brevity, subpopulations were collapsed. The x axis represents individuals from populations sorted according to their reported ancestries. Each individual is represented by a vertical stacked column of colour-coded admixture proportions that reflects genetic contributions from putative ancestral populations. Full size image

Looking at the resulting graph, we found that all populations exhibit a certain amount of admixture, with Puerto Ricans and Bermudians exhibiting the highest diversity and Yoruba the least. We further found distinct substructure among geographically adjacent populations that decreased in similarity with distance, suggesting that populations can be localized based on their admixture patterns. To correlate the admixture patterns with geography, we calculated two distance matrices between all reference populations based on their mean admixture fractions (GEN) and their mean geographic distances (GEO) from each other. Using these distance matrices, we calculated the relationship between GEN and GEO (Equation 1).

In the second step, GPS inferred the geographical coordinates of a sample of unknown origin by performing a supervised ADMIXTURE analysis for that sample with the putative ancestral populations. It then calculated the Euclidean distance between the sample’s admixture proportions and GEN. The shortest distance, representing the test sample’s deviation from its nearest reference population, was subsequently converted into geographical distance using the inferred relationships (Equation 1). The final position of the sample on the map was calculated by a linear combination of vectors, with the origin at the geographic centre of the best matching population weighted by the distances to M nearest reference population and further scaled to fit on a circle with a radius proportional to the geographical distance obtained by Equation 1 (see Calculating the bio-origin of a test sample in the Methods).

Biogeographical prediction for worldwide individuals

We applied GPS to approximately 600 worldwide individuals collected as part of the Genographic Project and the 1000 Genomes Project and genotyped on the GenoChip (Supplementary Table 1). We included the highly heterogeneous populations of Kuwait22, Puerto Rico and Bermuda23, as well as communities from the same country, such as Peruvians from Lima and indigenous highland Peruvians. We tested the accuracy of GPS predictions using the leave-one-out procedure. The resulting figure bears a notable resemblance to the world’s geographic map (Fig. 2). Individuals from the same geographic regions clustered together, and populations from different countries were largely distinguished. Assignment accuracy was determined for each individual based on whether the predicted geographical coordinates were within the political boundaries of the country and regional locations. GPS correctly assigned 83% of the individuals to their country of origin, and, when applicable, ~66% of them to their regional locations (Fig. 3, Supplementary Table 2), with high sensitivity (0.75) and specificity (0.99). These results supported the known connection between admixture patterns and geographic origins in worldwide populations5,6,7,8.

Figure 2: Geographic origin of worldwide populations. (a) Small coloured circles with a matching colour to geographical regions represent the 54 reference points used for GPS predictions. Each circle represents a geographical point with longitude and latitude and a certain admixture proportion. The insets provide magnification for dense regions. (b) GPS individual assignment based on 54 points. Individual label and colour match their known region/state/country of origin using the following legend: BE (Bermudian), BU (Bulgarian), CHB (Chinese), DA (Danish), EG (Egyptian), FIN (Finnish), GO (Georgian), GR (German), GK (Greek), I-S/N/W/E (India, Southern/Northern/Western/Eastern), IR (Iranian), ID/TSI (Italy: Sardinian/Tuscan), JPT (Japanese), LWK (Kenya: Luhya), KU (Kuwaiti), LE (Lebanese), M-O/B/N/D/T (Madagascar: Antananarivo/Ambilobe/Manakara/Andilambe/Toliara), X-G/H/M (Mexico: Guanajuato/Hidalgo/Morelos), MG (Mongolian), N-S/K/H/T (Namibia: Southeastern/Kaokoveld/Hereroland/Tsumkwe), YRI (Yoruba from West African), P-C/N (Papuan: Papua New Guinea/Bougainville-Nasioi), PH/PEL (Peruvian: Highland/Lima), PR (Puerto Rican), RO (Romanian), CA (Northern Caucasian), R-M/T/A (Russians: Moscow/Tatarÿ/Altaian), S-J/U/S/K/ (RSA: Johannesburg/Underberg/Northern Cape/Free State), IBS (Iberian from Spain & Portugal), PT (Pamiri from Tajikistan), TU (Tunisian), UK (British from United Kingdom), VA (Vanuatu), KHV (Vietnam). Note: occasionally all samples of certain populations (for example, Vietnamese) were predicted to the same spot and thus appear as a single sample. Full size image

Figure 3: Accuracy of assigning populations to their origin is coloured with dark blue for countries and light blue for regional locations. Populations for which regional data were available are marked with an asterisk. The average accuracy per population is shown in red and is calculated across populations given equal weights. Full size image

In terms of distances from the point of true origin, GPS placed 50% of the samples within 87 km from the origin with 80 and 90% of them within 645 and 1,015 km from the origin, respectively. GPS further discerned geographically adjacent populations known to exhibit high genetic similarity, such as Greeks and Italians. The prediction accuracy was correlated with both countries’ area (N=600, r=0.3, Student’s t-test P-value=0.03) and admixture diversity (N=600, r=−0.34, Student’s t-test P-value=0.01).

Individuals belonging to recently mixed populations, such as Kuwaitis and Bermudians, proved to be the most difficult to correctly predict, because their mixing was temporally brief and insufficient to generate a distinct regional admixture signature. As a result, such individuals are more likely to be placed within their original countries of origin, which is incorrect according to our scoring matric. For example, Kuwaiti individuals whose ancestors come from Saudi Arabia, Iran and other regions of the Arabian Peninsula22 were predicted to come from these regions rather than their current state.

To test GPS’s accuracy with individuals from populations that were not included in the reference population set, we conducted two analyses. We first repeated the previous analysis using the leave-one-out procedure at the population level. As expected, GPS accuracy decreased with 50% of worldwide individuals predicted to be 450 km away from their true origin. The predicted distance increased to 1,100 and 1,750 km for 80 and 90% of the individuals, respectively (Fig. 4a). Because GPS best localizes individuals surrounded by M genetically related populations, populations from island nations (for example, Japan and United Kingdom) or populations whose most related populations were under-represented in our reference population data set (for example, Peru and Russia) were most poorly predicted. Consequently, the median distances to the true origin were much smaller for individuals residing in Europe (250 km), Africa (300 km) and Asia (450 km) due to their being more commonly represented in the reference population data set compared with Native Americans and Oceanians. These results represent the upper limit of GPS’s accuracy when the specific population of the test individual is absent from the reference population data set.

Figure 4: Predicted distance from true origin for each individual using the leave-one-out procedure at the population level. Calculated for individuals of the Genographic (left) and the HGDP (right) data sets. Full size image

Next, we analysed over 600 individuals from the Human Genome Diversity Panel (HGDP) whose subpopulations and populations reside in countries that are not covered by our reference population data set (Supplementary Table 1). GPS’s accuracy further decreased with 50% of worldwide individuals predicted to be 1,250 km away from their true origin (Fig. 4b, Supplementary Data 1). As before, geographically remote populations were less accurately predicted with a higher error for regions that were poorly represented in the reference population data set. For example, the Brazilian Surui were predicted to be located 4,800 km away from their true origin. This was not surprising because the closest population in the reference population data set resided in Central America. By contrast, remote European populations that reside on islands or along ocean shores and are not surrounded by other populations (for example, Orcadians and French) were predicted to be ~1,200 km from their true origin, due to the higher density of nearby populations in the reference population data set. The results were also affected by populations with a history of recent migrations, such as Bedouins, Druze24 and the Pakistani Hazara, the latter being suspected of having some Mongolian ancestry25. Adding the HGDP populations to our reference population data set yielded similar results to those reported in Fig. 3. Overall, these results illustrated the dependency of GPS on the density of the reference population data set and indicated that accuracy improves with the inclusion of additional populations residing in geographically distant or isolated regions.

GPS applicability using a thinner marker set

PC-based applications have long aspired to provide accurate results down to the level of an individual’s village. However, due to different factors such as cohort effects26, these solutions have been mostly ad hoc. In fact, PC solutions were shown to discern only populations of selected cohorts, such as Italian villagers27 or Europeans11,12. When individuals of various ancestries are included in the cohort, the PCs are altered to the point where none of the individuals are correctly predicted to their country of origin or continental regions.

To test the precision of GPS’s predictions given finer regional annotation, we assessed 243 Southeast Asians and Oceanians and 200 Sardinians from 10 villages (4–180 km apart) using subsets of 40,000 and 65,000 GenoChip markers, respectively. We first tested whether admixture frequencies calculated over a smaller set of GenoChip markers provided sufficient accuracy. For this assessment, we carried out a series of admixture analyses in a supervised mode for nine 1000 genomes worldwide populations using smaller sets of markers (95,000, 65,000 and 40,000) and compared the admixture proportions with those obtained using the complete marker set (Fig. 5). We found small differences in the admixture proportions that slowly increased for smaller GenoChip marker sets. The largest observed difference (3%) for the smallest number of markers used in our analyses (40,000) was within the natural variation range of our populations and did not affect the assignment accuracy. We were thus able to supplement the reference population set with the newly tested populations.

Figure 5: Estimation of the bias in the admixture proportions of nine 1000 Genomes populations analysed over a reduced set of GenoChip markers. The mean (left) and maximum (right) absolute difference in individual admixture coefficients are shown. Full size image

Fine-scale biogeography down to home island

Next applied to Southeast Asians and Oceanians (Supplementary Table 3, Supplementary Fig. 1), GPS’s prediction accuracy was stringently estimated as the individual assignment to the region occupied by one’s population or subpopulation. The prediction accuracy for Han Chinese (64%) and Japanese (88%) obtained here using ~40,000 markers was the same as that obtained in the complete data set, as expected (Fig. 5).

GPS’s assignment accuracy for the remaining Southeast Asian and Oceanian populations (87.5%) and subpopulations (77%) (Fig. 6) was higher than that obtained for worldwide populations (Fig. 3). These results reflect GPS’s greatest advantage compared with alternative methods. Unlike PCA and SPA whose accuracy is lost with the addition of samples of various ancestries12, GPS predictions increase in accuracy when provided a more comprehensive reference set.

Figure 6: Prediction accuracy for Southeast Asian and Oceanian subpopulations and populations. Pie charts depicts correct mapping at the subpopulation level (red), population level (black) and incorrect mapping (white). Full size image

A few populations stand out in that they are not reliably assigned to their region of origin (Fig. 6). Polynesians and Fijians in particular are not well predicted and incur the highest misclassification rates (47 and 40%, respectively) mainly to Nusa Tenggara and the Moluccas Islands. These results are not surprising given two main issues. First, Polynesian populations, East Polynesian populations in particular, are not well represented in the large databases from which the GenoChip’s ancestry informative markers were ascertained17, so a likely ascertainment bias exists for Polynesia. The second issue relates to the complex settlement history of the Oceania region. Interestingly, this aspect of population history is clearly reflected in the results produced.

The component identified in the admixture analysis (Supplementary Figure 1) as representing Oceania (pink) most likely represents the early migrants into the region some 50,000 years ago. This component is dominant in populations from New Guinea and Australia, which were joined together, making up the ancient landmass of Sahul, until approximately 11,000 years ago when they became separated due to rising sea levels. This Oceanic signature is also seen in Island Southeast Asia, such as Nusa Tenggara and the Moluccas, which indicates the likely pathway taken to Sahul. The Remote Oceanic settlement, represented here by Fiji and Polynesia, is much more recent and has been associated with the Neolithic expansion of peoples out of East Asia, through Island Southeast Asia and ultimately through Near Oceania and the rest of the Pacific including Polynesia.

The first people to arrive in Remote Oceania (the region east of the Solomon Islands) did so only about 3,000 years ago and are associated with the expansion of the Lapita cultural complex as far east as Fiji, Samoa and Tonga, on the edge of the Polynesian Triangle. Mitochondrial DNA and Y chromosomal data from Remote Oceanic populations, Polynesians in particular, indicate mixed ancestry28. MtDNA suggests primarily Island Southeast Asian ancestry for Remote Oceania, indicated by high frequencies of mtDNA haplogroup B4a1a and descendent lineages, with some Near Oceanic contributions (identified by haplotypes belonging to haplogroups P and Q). Y chromosome studies, however, show a stronger Near Oceanic component in Polynesian ancestry, with some Southeast Asian contribution29. Genome-wide studies are consistent with this mixed ancestry for Polynesian, Remote Oceanic and some Near Oceanic populations30. Our findings, therefore, represent the heterogeneity of Remote Oceanian populations due to their long history of expansions and settlements, which is reflected by their complex population structure (Supplementary Fig. 1) and GPS predictions (Fig. 6).

Fine-scale biogeography down to home village

The island of Sardinia (24,090 km2) was first settled 14,000 years ago and experienced a complex demographic history that includes low effective sizes due to plagues and wars and scant matrimonial movement, which accentuated stochastic effects. Interestingly, Sardinians have been described both as a genetic isolate with endogamy peaking in the central-southern and mountain areas with little internal mobility31 and a heterogeneous population when microareas or close single village are considered32,33,34.

Applied to Sardinian villagers (Supplementary Fig. 2), GPS correctly placed a quarter of the Sardinians in their village, as well as half within 15 km and 90% of individuals within 100 km of their homes (Fig. 7, Supplementary Fig. 3). As expected from the high percentages of matrilocal marriages35 and residence36,37 common to Sardinia, the locations of females were better predicted than those of males, with 30% placed in their exact village of origin compared with 10% of the males.

Figure 7: The geographical location of the examined Sardinian villages. The mean predicted distances (km) from the village of origin are marked by bold (females) and plain (males) circles. Full size image

Our findings revealed the Sardinians to have a genetic microheterogeneous structure affected both by altitude and physical location. The prediction accuracy as the distance to the village of origin (Supplementary Fig. 3) are detailed in Supplementary Table 4. The correlations between altitude and the distance from the village of origin are shown in Supplementary Table 5. The average predicted distances from the villages roughly corresponded to Sardinian subregions (Fig. 7). Unsurprisingly, the more precise positioning refers to individuals coming from Ogliastra (east Sardinia), since this area is characterized both by high altitude, high endogamy and relative cultural isolation, whereas populations from the western shores are considered to be more admixed. We found a significantly negative correlation between altitude and the predicted distance to villages for males (N(coastal)=96, r(coastal)=−0.21, Student’s t-test P-value(coastal)=0.019; N(coastal)=27, r(inland)=−0.38, Student’s t-test P-value(inland)=0.024). The results for females were marginally significant for all villages (N=126, r=−0.14, Student’s t-test P=0.06) and inland villages (N=29, r=−0.27, Student’s t-test P=0.08), but not for coastal villages (N=97, r=−0.1, Student’s t-test P=0.14). These results are expected from the high proportion of endogamy (64.1% in plain, 82.8% in mountains) that are correlated with the rise of altitude35. This correlation was particularly high in inland compared with coastal villages. Our results not only fit with the genetic and demographic characteristics of Sardinians but also resolve conflicting findings due to the matrilocal matrimonial structure36,38,39,40. Finally, because GPS carries a sample-independent analysis, predictions for worldwide individuals were largely identical to those previously reported (Fig. 3) in both analyses.

Comparing the performances of GPS with SPA

The SPA tool explicitly models the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space12. SPA can model the spatial structure over a sphere to predict the spatial structure of worldwide populations and was designed to operate in several modes. In one mode, when the geographic origins of the individuals are known or when the geographic origins of some individuals are known, they can be used as training set. Using the later approach, Yang et al.12 trained the SPA model on 90% of the individuals to predict the locations of the remaining 10%. SPA was reported to predict the geographic origins of individuals of mixed ancestry, which cannot be done with PCA12.

When analysing a Europeans-only data set, SPA was successful in assigning close to 50% of the individuals to their correct country of origin. However, when worldwide individuals were analysed, SPA distorted the distances between continents and failed to assign even a single individual to his home country (Fig. 8a), with Melanesians being misclassified as Indians12 being the most obvious example.

Figure 8: A comparison of SPA and GPS prediction accuracy for continental regions. The mean longitude and latitude for each population were calculated by averaging individual spatial assignments (N=596). After assigning populations to continental regions, the mean and s.d. were calculated based on the predicted coordinates for each region. Dashed lines mark s.d. (a) SPA prediction accuracy for continental regions obtained from Yang et al.12 results (their supplementary Table 112). The mean coordinates are marked with a triangle (expected) and square (Predicted by SPA). (b) Comparing the results for worldwide populations analysed here for SPA (square), GPS (circle) and for the real coordinates (triangle). Full size image

We compared the accuracy of GPS with that of SPA by providing SPA with favourable conditions to its operation. We used the more rigorous application involving a training data set and ran SPA in two steps, as described by Yang et al.12. First, we provided SPA with the genotype file of worldwide populations (596 individuals, 127,361 SNPs) with their complete geographic locations (Supplementary Table 2) without any missing data. When executed, SPA produced the model file that would be later used to predict the geographical locations. Next, we provided SPA with the model file and the same genotype file. Because of the absence of missing geographical coordinates, SPA was expected to yield geographic coordinates that closely resemble those it received in the first step. However, SPA failed to assign 98% of individuals to their countries and placed most individuals in oceans or in the wrong continental region (Fig. 8b). By comparison, GPS was used in the leave-one-out individual mode, in which the geographic coordinates of the populations in the reference population data set were recalculated without the test individual. GPS accurately assigned nearly all individuals to their continental regions, countries and regional locations with a high degree of accuracy.

We tested SPA with four additional data sets and calculated the assignment accuracy for each one. When providing SPA with the combined data set of worldwide individuals and Southeast Asian and Oceanian individuals (~40,000 SNPs) with their complete geographical coordinates (Supplementary Table 3), we obtained a similar assignment accuracy of 2% for the worldwide individuals, although with different coordinates and an assignment accuracy of 1.5% for the remaining individuals. When testing only Southeast Asian and Oceanian individuals, the assignment accuracy was 4.8%. Unfortunately, we were unable to estimate the prediction accuracy for about 20% of these samples because SPA’s results (e.g., Latitude=−2, Longitude=256) exceeded those of a three-dimensional sphere. Finally, we calculated the assignment accuracy for worldwide individual and Sardinian individuals (~65,000 SNPs), again by providing complete geographical data (Supplementary Table 4). SPA coordinates for worldwide individuals varied from our previous analyses, being accurate for three individuals (0.5%) but completely inaccurate (0%) for Sardinians whether they were tested with the worldwide individuals or separately.

We suspect that the inaccuracy of SPA predictions in the tested mode of operation results from the predictions for test individuals being affected by other individuals in the cohort. As such, it suffers from the same limitations as PCA when analysing a diverse cohort. Even if a single individual from a different continent is included in the data set, SPA’s accuracy drops to 0%. In other words, for SPA to correctly assign every other European individual to his country, the individuals need to be a priori confirmed as Europeans, which makes SPA impractical.

A comparison of the runtime and CPU timings was done on a Linux machine (x86 64) with an Intel(R) Xeon(R) E5430 processor 2.66 GHz CPU and 8 GB memory. The SPA runtime (wall time) was well over 3 h, compared with 6 min for GPS, including the initial step of calculating admixture proportions using ADMIXTURE.