The Australian continent holds some of the earliest archaeological evidence for the expansion of modern humans out of Africa, with initial occupation at least 40,000 y ago. It is commonly assumed that Australia remained largely isolated following initial colonization, but the genetic history of Australians has not been explored in detail to address this issue. Here, we analyze large-scale genotyping data from aboriginal Australians, New Guineans, island Southeast Asians and Indians. We find an ancient association between Australia, New Guinea, and the Mamanwa (a Negrito group from the Philippines), with divergence times for these groups estimated at 36,000 y ago, and supporting the view that these populations represent the descendants of an early “southern route” migration out of Africa, whereas other populations in the region arrived later by a separate dispersal. We also detect a signal indicative of substantial gene flow between the Indian populations and Australia well before European contact, contrary to the prevailing view that there was no contact between Australia and the rest of the world. We estimate this gene flow to have occurred during the Holocene, 4,230 y ago. This is also approximately when changes in tool technology, food processing, and the dingo appear in the Australian archaeological record, suggesting that these may be related to the migration from India.

Genetic and archaeological evidence suggests that anatomically modern humans expanded from Africa (1, 2) and colonized all corners of the world, replacing with limited gene flow local archaic Homo populations, such as Neanderthals (3) and the Denisovans (4, 5). The expansion of modern humans apparently proceeded via two routes: the northern dispersal that gave rise to modern Asians 23,000–38,000 y ago (6, 7) and an earlier southern dispersal, which followed the coast around the Arabian Peninsula and India, to the Australian continent (5, 7). It has been suggested that the ancestors of aboriginal Australians and Papua New Guineans diverged from the ancestral Eurasian population 62,000–75,000 y ago (7) and, based on archaeological evidence, reached Sahul (the joint Australia–New Guinea landmass) by at least 45,000 y ago (8⇓–10). Whereas coastal New Guinea (but not the highlands) subsequently experienced additional gene flow from Asia (associated with the Austronesian expansion) (9), the extent of isolation of Aboriginal Australians following initial colonization is still debated. The prevailing view is that until the arrival of the Europeans late in the 18th century, there was little, if any, contact between Australia and the rest of the world (7, 11, 12), although some mtDNA and Y chromosomal studies suggested some gene flow to Australia from the Indian subcontinent during the Holocene (13⇓–15). Here, we analyze genome-wide SNP data and find a significant signature of gene flow from India to Australia, which we date to about 4,230 y ago.

We assembled genome-wide SNP data from aboriginal Australian samples from the Northern Territories (AUA) (5, 13), highlanders of Papua New Guinea (NGH) (16), 11 populations from island Southeast (SE) Asia (5), and 26 populations from India (17), including Dravidian speakers from South India (5, 18). We also included data from the Yorubans from Ibadan; Nigeria (YRI); individuals of northern and western European ancestry living in Utah (CEU); Han Chinese individuals from Beijing, China (CHB); and Gujarati Indians from Houston, TX (GIH) (19). The final dataset comprised 344 individuals (Table S1 and Fig. 1); and after data cleaning and integration, we had 458,308 autosomal SNPs for the analysis.

Results

Genetic Relationships Between Populations. First, to place aboriginal Australians into a global context, we carried out principal component analysis (PCA) (20). The first two principal axes are driven by genetic differentiation between Africans, Australians/Papua New Guineans, and Europeans/Indians/Asians (Fig. S1A). AUA are close to NGH but extend toward the European/Indian/Asian grouping, suggesting a common origin with the former and admixture with the latter. AUA and NGH are separated along PC4, after the separation of CEU and CHB along PC3 (Fig. S1B). The prior separation of CEU and CHB could suggest that AUA and NGH diverged after European and Asian populations, which, according to archaeological evidence (21) and estimates based on various genetic markers, happened between 37 and 60 kya (6, 7, 22). Alternatively, this result could suggest smaller Ne/stronger drift in AUA and NGH, or reflect ascertainment bias because most of the SNPs on the Affymetrix arrays were ascertained in individuals of European and African ancestry. To better understand the relationships among AUA, NGH, and neighboring populations from Island SE Asia, we carried out PCA on these populations only (Fig. S1C). PC1 separates NGH and AUA from the other groups, whereas, interestingly, PC2 separates AUA and the Mamanwa (MWA) (a Negrito group from the Philippines) from NGH and the other SE Asian groups. PC3 groups the MWA with NGH but separates the Australians (Fig. S1D). The almost-identical eigenvalues for PC2 and PC3 suggest that the Mamanwa are equidistant from AUA and NGH (Fig. S1 C and D); overall, these results are consistent with previous indications of shared ancient ancestry among Australians, NGH, and the Mamanwa (5, 23).

Divergence-Time Estimation. We next examined genome-wide patterns of linkage disequilibrium (LD) to estimate the divergence times among populations and investigate past population size changes (24, 25). Because LD is a property of genomic regions and not of individual SNPs, it is not expected to be strongly affected by ascertainment bias (25, 26). For this analysis, we binned the genome-wide data (588,335 SNPs) into 50 evenly spaced recombination distance categories (0.005–0.25 cM) from AUA, NGH, MWA, Dravidian speakers from South India, and the CEU, CHB, GIH, and YRI populations. Genetic distances were interpolated from genome-wide recombination rates estimated as part of the HapMap project (27). For each population and for every pair of SNPs within each distance category, we calculated the squared correlation (rLD2) in allele frequencies (25) by randomly selecting 10 individuals from each population and adjusting the measurement for each pair of SNPs by sample size (to account for missing data) (25); in total, around 150 million pairwise LD observations were made. The results (Fig. 2) show that LD increases with increasing geographic distance from Africa, as found previously (25, 26). The most extreme LD values over the shortest genomic distances (up to 0.075 cM) are seen in NGH, followed by MWA, whereas the most extreme LD values over the longest genomic distances (0.075–0.25 cM) are seen in the MWA, followed by the NGH. LD between SNPs separated by short genomic distances (short-range LD) is informative about older population size, relative to LD observed between SNPs separated by greater genomic distances (long-range LD) (12, 24, 28). Because decay of LD is inversely related to changes in effective population size (Ne) over time, our results suggest serial bottlenecks associated with the expansion of modern humans out of Africa, with the strongest ancient bottleneck being observed in the NGH. Furthermore, the MWA seem to have experienced a more recent bottleneck, possibly associated with the Austronesian expansion, as suggested previously from analyses of mtDNA sequences (29). In comparison with the NGH and the MWA, Australians exhibit the least extreme LD values, suggesting either a weaker bottleneck or less isolation experienced by this population. Fig. 2. LD measured for each population and each pair of SNPs within the 50 evenly spaced recombination distance categories. Shortest genetic distances between the SNPs are represented on the left and progress toward the largest genetic distances on the right. The AUA, the NGH, and the MWA have all experienced ancient admixture with the Denisova hominins (5), and although admixture in general is known to decrease genome-wide LD (30), ancient admixture has been shown to increase long-range LD (30, 31). It is possible that long-range LD values in these three populations are inflated because of this ancient gene flow from the Denisovans. However, because the population bottlenecks, associated with the expansion of modern humans out of Africa increase genome-wide LD (25, 26), including the long-range LD, and we do expect to observe a strong effect of these bottlenecks in the Australian, the NGH and the MWA populations, it is hard to distinguish here the signature of bottlenecks from the possible signal of ancient admixture. We, therefore, can only conclude that the Denisova gene flow might have contributed to the increase in the long-range LD we observe in these three populations. Importantly, however, because the Denisova gene flow occurred into the common ancestor of these three populations (5), the differences in LD we observe between them cannot be attributable to this ancient admixture. The correlation in LD patterns between populations can be used to estimate their time of divergence (25). The rationale behind this calculation is that immediately after two populations diverge, genome-wide LD in the two daughter populations should be perfectly correlated, but the correlation will decay exponentially over time, with the rate of decay dependent only on the recombination distance between the markers, not on Ne (25). The correlation in LD between populations is independent of Ne, because recombination events essentially behave like new neutral mutations: there will be more of them in a big population but fewer of them will fix via drift than in a small population, and as these two processes cancel each other out exactly, the rate of LD decay is not influenced by the population size. We computed the correlation between the LD values for each pair of populations and for each recombination distance category and estimated the time of divergence from the rate of decay of the correlation in LD values with recombination distance (25). To be able to compare our results to previous studies (25), and to exclude the effect of potential later admixture (32), for this analysis, we used only the first 20 recombination distance categories, i.e., only SNP pairs located at distances of up to 0.1 cM from each other. We estimate the average time of divergence for the main continental groups as follows: European (CEU) and Asian (CHB) populations and populations of greater Australia (AUA and NGH) have diverged from the African populations (YRI) 66 kya, and the split between CEU and CHB is estimated to have occurred 43 kya. These dates are in good agreement with previous studies, based on different types of data and using different methods (6, 22, 25). The divergence times among the AUA, NGH, and MWA (the putative descendants of the early southern route migration) were 36 kya, roughly in concordance with the date of divergence estimated based on the distribution of the bacterium Helicobacter pylori (33) but too recent given the purported date of the dispersal into Sahul at 45 kya (8⇓–10). Despite LD being a measure expected to be relatively unaffected by ascertainment bias (25, 26), this may reflect some effect of this bias on the estimation; because a smaller number of SNPs included into the genotyping platform is expected to be polymorphic in these populations relative to the populations in which these SNPs were discovered, a smaller number of pairwise LD observations could be made. This will make the observed correlation in LD measurements between any two populations appear higher, reducing the rate of decay of the correlation in LD values and resulting in the time of divergence being underestimated. A previous study that estimated the time of divergence for the YRI, CEU, and CHB populations, using the same LD measure but half the number of markers, also reported low divergence dates (25). Because the Denisova gene flow occurred into the common ancestor of the AUS, NGH, and MWA (5), which is before the divergence time, it should not have any effect on the time of divergence estimation. In sum, these results confirm a common origin but an ancient split (at least 36 kya) for the Mamanwa, Australians, and NGH, supporting the view that these populations represent the descendants of an early southern route migration out of Africa and that Australians and New Guineans diverged early in the history of Sahul, when they were still one landmass, and not when the lands were separated by rising sea waters around 8,000 y ago.

Admixture with India. The PCA results clearly indicate some signal of admixture in the Australians (Fig. S1A). This could be attributable to recent European admixture, as reported previously (12, 34). To investigate this signal of admixture, we first carried out a PCA of AUA, NGH, Europe, and India (Fig. 3A). PC1 separates AUA and NGH from the other groups, whereas PC2 separates the Andamanese Onge at one end and CEU at the other, with mainland Indian populations spread roughly along a north-to-south cline, as observed previously (17). Apart from two outliers, the Australians are distributed toward the middle of the Indian cline and not toward Europe. Thus, these results do not indicate that Europeans are the source population for the signal of admixture and suggest, instead, that the signal comes from the Indian subcontinent. Fig. 3. Results of the PCA and ADMIXTURE analyses. (A) PCA of AUA, NGH, CEU, and 26 Indian populations. PC1 is driven by differences between the populations of Sahul and Eurasia. PC2 reflects a north-to-south gradient of European ancestry observed in Indian groups, with the southernmost group being the Onge, a Negrito population from the Andaman islands. (B) Population structure estimated using ADMIXTURE for K = 4. Each vertical bar represents an individual and each color describes the proportion of each individual’s genome that comes from one of the four hypothetical ancestral populations (K). The asterisk indicates the two individuals from the Srivastava group. It has been shown previously that uneven sampling has a strong influence on the results of PCA (35). Although the sample sizes of populations used in this analysis were unequal, by making them equal, we would introduce a bias in that the analysis will cease to be blind to population labels (i.e., we have to know how to group individuals into populations to make population sizes equal). Therefore, to test the robustness of these results to uneven sampling, we repeated the analysis 10 times, each time randomly sampling 70% of the samples. Although slight differences in the results were present, the overall results and conclusions remain unchanged (Fig. S2). To further investigate this result, we then analyzed genetic ancestry using the maximum-likelihood–based clustering algorithm ADMIXTURE (36). Briefly, this method considers each person’s genome as having originated from a specified number (K) of hypothetical ancestral populations and then describes the proportion of each individual’s genome that comes from each of these ancestral populations. To avoid potential problems caused by existing LD between markers, we first used the PLINK tool to thin the dataset by excluding from the analysis SNPs in strong LD. Our initial experiments showed that LD pruning, based either on correlations between SNPs or on correlations between linear combinations of SNPs, did not have any noticeable effect on the results of the ADMIXTURE analysis. Nevertheless, to save computational time, we used the pruned dataset, comprising nearly 170,000 markers, for all of the subsequent runs of ADMIXTURE. We tested K = 2 through K = 10 and performed 10 independent runs for each value of K. We monitored consistency between the runs and used ADMIXTUREs cross-validation procedure to establish the value of K that fits the data best (Fig. S3). Although the lowest cross-validation error is exhibited by K = 3 (Fig. S3), the Indian component we are interested in is identified only at K = 4; because the difference between the CV error for K = 3 and K = 4 is quite small, at least four times smaller than the difference between K = 3 and any other value of K (Fig. S3), we report here the results for K = 4. At K = 4 (Fig. 3B), Australians are assigned a component that is present at high frequency in mainland India and is shared exclusively between Australia and India [with the exception of one NGH individual, who is an outlier relative to the other NGH samples and, according to PCA results (Fig. S1A), is closer to AUA than to other individuals in the NGH population]. Moreover, this component is observed in similar proportions in all of the Australians, suggesting that it is uniform throughout the genome. By contrast, the “European” ancestry component is present in only a few Australians and in varying amounts, as expected, for very recent admixture such as observed in African Americans (20, 37). These AUA individuals showing evidence of recent European ancestry were excluded from further analyses. Thus, the Indian admixture signal revealed in AUA by this analysis does not exhibit the same characteristics as recent European admixture. Identical results were obtained using another maximum-likelihood–based software frappe (38) with the full set of 460,000 markers (Fig. S4). Next, to be more confident that the Indian component we observe in AUA is indeed Indian and does not reflect some unsampled ancestry, we repeated the ADMIXTURE analysis with individuals of African, European, Asian, and SE Asian ancestry, including the Mamanwa. For the Indian group, we used genotypes from the Chenchu and Kurumba (tribal Dravidian-speaking populations) and from the nontribal Dravidian speakers from south India, because these groups are closest to the axis of admixture in the PCA and have the highest frequencies of the shared Australia–India ancestry component in the previous ADMIXTURE analysis. After the dataset was thinned for SNPs in LD, we had 187,470 SNPs remaining for this analysis. This time, the lowest cross-validation error is exhibited by K = 5, whereas the Indian component is identified at K = 7 (Fig. S5). At K = 5, the proportion of Australian ancestry not shared with the New Guineans most closely resembles the ancestry profile of the three Indian populations at this value of K. Additionally, at K = 7, six runs with the highest log-likelihood scores ascribe 11% of Australian ancestry to India, whereas an additional 9% is shared with the Mamanwa (Fig. S5). To further verify the signal of Indian admixture, we used TreeMix (39) to find a population graph that best describes the relationship between populations in the dataset by testing for gene flow between them. This method uses the genome-wide allele frequency data to first find the maximum-likelihood tree of populations and then infer migration events by identifying populations that poorly fit this tree. Because it has been shown previously that migrations inferred for Oceanian populations differ depending on whether the SNPs involved in the analysis were ascertained in a Yoruban or a French individual (39), we have excluded YRI and CEU individuals from this analysis. (For the results of the analysis that included these populations, see Fig. S6.] After removal of SNPs in LD, the resulting dataset comprised 150,000 markers. We first inferred the maximum likelihood tree of the nine populations included in the analysis (Fig. 4A) and then analyzed the residuals (Fig. 4B) to identify pairs of populations that are more related to each other than is captured by this tree. We then sequentially added migration events to the tree, until we found a graph with the smallest residuals (Fig. 4 C and D). The graph that best fits the data has four inferred migration edges: Chenchu to CHB (weight, 4%), Onge to India (17) (weight, 6%); one of the edges captures shared ancestry between NGH, AUA, and MWA (5, 23) (weight, 15%); and one of the edges provides evidence for the gene flow from India to Australia. The weight for this migration edge is estimated to be 11%, in agreement with the admixture proportion obtained in the ADMIXTURE analysis. The P value (which here describes how much a particular inferred migration improves the fit to the data) for all migrations is estimated to be at least 1 × 10−5. Fig. 4. Results of the TreeMix analysis. (A) The maximum-likelihood tree of nine populations included in the analysis. (B) Residual fit from the tree. Residuals above zero indicate pairs of populations that are candidates for admixture events. (C) Population graph that best fits the data, based on the smallest residuals (D). Finally, to further test the robustness of this inference, we used the 4 Population Test statistic f4 (17). The four populations considered in this analysis were AUA, NGH, India, and YRI. For the Indian group, we again used genotypes from the Chenchu and Kurumba, as well as from the nontribal Dravidian speakers from south India. YRI were chosen as an outgroup that is equally distant from the other three groups, and the allele frequencies in the YRI were used for normalization, where we weighted each SNP by a quantity proportional to its expected genetic drift in the ancestral group (YRI) (17). We calculated allele frequency differences at each SNP between all pairs of populations, restricting the analysis to SNPs that were polymorphic in all of the groups to minimize any effect of the ascertainment bias. This reduced the dataset to 250,000 SNPs. The expectation is that if there was no gene flow from India into Australia, then the allele frequency differences observed between YRI and India should be uncorrelated with the allele frequency differences observed between AUAs and NGH. However, if this correlation deviates from zero, then this suggests that there was gene flow from India into either AUA, NGH, or both. A Weighted Block Jackknife approach, where the genome was divided into nonoverlapping 5-cM blocks and each block was dropped sequentially (17, 40), was used to correct for nonindependence of SNPs and to assess statistical significance via a Z score (17). In our analysis, the f4 statistic has a Z = −1.93 (P = 0.026), allowing us to reject the simple tree (YRI(India(AUA,NGH))) and suggesting, instead, that the data are best described by a mixture of two trees: (YRI(India(AUA,NGH))) and (YRI(NGH(AUA, India))). The fact that the Z score has a negative sign is important here, because it indicates gene flow between India and AUA (or NGH and Yoruba) and not between India and NGH. We repeated this analysis, substituting an Asian population (CHB) and a Negrito population of the Andaman Islands (Onge) for India; for both analyses, the resulting f4 statistic had much higher P values (Z = −0.11, P value = 0.45; and Z = −0.28, P value = 0.38, respectively). Thus, the f4 statistics indicate a signal of gene flow from India to Australia and, furthermore, that the source population is more closely related to present-day Dravidian-speaking Indian groups than to Onge. In sum, four analyses (PCA, ADMIXTURE, TreeMix, and f4 statistics) all indicate gene flow from India to Australia. Although previous analyses based on a limited number of markers (41) or uniparental data (13, 14) also suggested genetic relationships between Australia and India, neither a previous study of genome-wide SNP data from Australians (12) nor the analysis of a genome sequence of an aboriginal Australian (7) reported any such gene flow. However, the genome-wide SNP study (12) did not include any populations from India, and although the analysis of the Australian genome sequence did find indications of genetic relationships with groups from India, they concluded that this represented some genetic ancestry in the Australian genome sequence that could not be assigned to any existing population (7). Based on the results above, it is likely that the signal of Indian genetic ancestry in the Australian genome sequence does, in fact, reflect the same gene flow from India that we detect in our analyses.