The question of national and individual origins has a corporeal and concrete dimension, and a mythic and symbolic one. This is evident in the religious traditions which most of the world’s populations adhere to. Israel is both literally and figuratively a descent group. They issue from the tribes descended from the sons of Jacob. Those who convert into the Jewish religion customarily also convert into the Jewish nation, and so figuratively share the same descent. Similarly, among Muslims there is a particular prestige given over to the descendants of Muhammad, the Sayyids. Within Hinduism the importance of descent groups manifests generally in terms of the endogamy prevalent among South Asians, and also in specific cases, such as with gotras. The fundamental atomic basis of Confucian religious morality is arguably filial piety. Confucius’ descendants still play a prominent role in modern China promoting his ideas.

But descent also has a scientific and concrete aspect. Sometimes the mythic and scientific align. It does seem that the notional male line descendants of Genghis Khan are actually descended from one individual who flourished ~1,000 years ago. In other instances the connection is complex. Jews do seem to share common descent, but it is also evident that they have mixed greatly amongst the nations. And sometimes the inferences generated by science may warrant a reconsideration of treasured myths. Most reasonable people will probably accede to the clear overwhelming descent of South Asian Muslims from the native people of the Indian subcontinent, but the genetics clinches that. True, there is quite often a clear trace of Middle Eastern and African ancestry among the Muslims of South Asia above and beyond what may be found amongst non-Muslims, but often this component is dwarfed by a minor East Asian element which seems to warrant no cultural memory! In this post I will not address specific cases as much as a general framework. I have been talking about genetics, and to a lesser extent South Asian genetics, since 2004 on this weblog. But we know so much more now than we did then. I thought it was time for me to sit down and actually condense the current state of knowledge as best as I can. I will not address the biomedical dimension of human population genetics in this post, only the historical ones.

First, a few notes. I understand that this is a controversial and fraught topic. One major issue I have when I bring up this area of knowledge in a South Asian forum is that people accuse me of promoting models which I barely understand. What I mean is that often I have to go and look things up to figure out what people are actually accusing me of implying. I didn’t grow up in South Asia, so I don’t know the political-cultural battles too well. Please be explicit and clear in your comments, and don’t assume I can connect the dots! Â Also, I’m going to apologize to some of you ahead of time for deleting your comments. I am going to track this thread and actually answer questions from interested parties, which means that I will need to shave off the noise. I won’t apologize to the people whose comments I delete because they address my comment moderation policy. Finally, I am going to use the word “Indian” from this point onward where in other cases I’d use “South Asian.” On the historical time scales that I’ll be addressing our ancestors were considered Indians (“Hindus”) by the rest of the world, and this seems a time where this clarity of terminology should trump contemporary geopolitical valences.

Why does any of this history matter? I have a hard time addressing this insofar as I have weak conditional effects based on my ancestry. By this, I mean that the details of my ancestry don’t matter much to me, except as a source of amusement or interest. I hope you don’t view me any differently if you find out that I seem to have a close genetic relationship to South Indian Dalits! (I do, probably far closer than you) You can also download a raw text file of my 23andMe v3 genotype if you want to poke around (I’ve made it public domain). But this sort of information matters for other people a great deal. I am, for example, kind of tired of listening to brown people talk about their non-Indian ancestry, whether it be Syrian Christians who claim Jewish antecedents, Jatts who claim Scythian antecedents, or Muslims who claim Arab, Turk, or Persian origins. From what I can tell reviewing the genetic data there is a grain of truth to many of these claims, but most brown people have ancestry that is overwhelmingly…brown. That’s pretty evident on our faces.

Second, I do know that finding ancestry from various groups can change how people view themselves. To give a personal example I have a friend who is a white American whose maternal grandparents were very racist against black people. After a detailed inspection of his genome it’s pretty clear that he’s ~5% African in ancestry. Some of his paternal relatives have been genotyped. This black ancestry doesn’t show up on that branch of his family tree, so by elimination it seems likely that it was his anti-black side which had black ancestry (my friend told me that as a child he thought his maternal grandfather did look a touch black, an observation triggered by their vocal racism). A story is here, which he is only beginning to explore. There is something similar in my own family. My maternal grandmother comes from a family with some distant Middle Eastern ancestry. This obviously a point of pride. But a closer look at my mother’s genome makes two things clear: first, she does have a very small proportion of Middle Eastern ancestry. This could be noise, but it seems associated with a smaller African component, which is not uncommon among people of Muslim origin in the Indian subcontinent from what I have seen. But, a much larger fraction of my mother’s genome exhibits clear derivation from Southeast Asia, perhaps from an Austro-Asiatic or Tibeto-Burman group. But there is no mention of this in my family’s oral history.

But enough! Brass tacks, who are we as brown folk? The map at the top of this post gets at a big part of the answer. It was generated by the blogger behind The Jatt Gene using results from the Harappa Ancestry Project. It shows the rough distribution of a genetic element associated with the peoples of the Andaman Islands, and found from Pakistan to Vietnam to Indonesia. What does it mean? The Harappa Ancestry Project has thousands of individuals from hundreds of populations, and hundreds of thousands of genetic markers per person. This data set was then run through the program ADMIXTURE, which breaks apart the ancestry of individuals contingent upon the variation you throw into the program and the number of ancestral populations you want it to generate, the latter defined by the parameter “K.” This is just software, a dumb algorithm, so it needs to be used with care. But to give a concrete example, consider that you have three populations in your data set:

White Americans

Black Americans

Nigerians

You tell ADMIXTURE to break apart the genomes of the individuals in your data set into at most two components. Two clusters if you will. The result in this case is going to be straightforward:

The White Americans will be in one cluster

The Nigerians in the other

The black Americans will be a mix, with an average admixture fraction of 80% and 20%

The program is easy to interpret in this case, as we have a history, as well as other lines of evidence, to interpret these results. One component is clearly African ancestry, and the other is European. African Americans are on average 80% African and 20% European. So ADMIXTURE nicely popped out with that result.

What does ADMIXTURE tells us about South Asians? First, it depends on what reference populations you use and how many clusters you want it to generate. I’ve addressed this detail before. But the Harappa Ancestry Project has lots of Indian populations. What you immediately see is that at higher K values a “South Asian” cluster breaks out. This cluster has the highest frequencies in southern and eastern India. It drops off as one moves west to Iran and east to Southeast Asia. Case closed?

Not quite. ADMIXTURE is a computer program. It can give strange results. It does not tell us reality, it tells us the the result of an algorithm. The “South Asian” cluster exhibits some peculiarities in terms of how it relates to other groups which can not be easily explained by history. I won’t get into the details of that, but move to the main issue: deeper analytic techniques as well as moving up K’s allows the “South Asian” cluster to fractionate into two dominant components. The major insight was unveiled nearly two years ago in a paper published in Nature, Reconstructing Indian population history:

India has been underrepresented in genome-wide surveys of human variation. We analyse 25 diverse groups in India to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most Indians today. One, the ‘Ancestral North Indians’ (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, whereas the other, the ‘Ancestral South Indians’ (ASI), is as distinct from ANI and East Asians as they are from each other. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39-71% in most Indian groups, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the indigenous Andaman Islanders are unique in being ASI-related groups without ANI ancestry. Allele frequency differences between groups in India are larger than in Europe, reflecting strong founder effects whose signatures have been maintained for thousands of years owing to endogamy. We therefore predict that there will be an excess of recessive diseases in India, which should be possible to screen and map genetically.

(ungated copy of the paper)

Using public data sets multiple bloggers have replicated the general shape of these results. The Harappa Ancestry Project has several populations from the Andaman Islands, and at K = 11 a component which is fixed in the Onge tribe correlates almost perfectly with the ANI/ASI ratios from the above paper.

Here’s the short of it: Indians are hybrids between two ancient and very distinctive groups. If you want to know more details, I posted about it on my science blog. The top line is that the ANI is very much like Middle Eastern and European populations. In fact, ANI seems no closer to the ASI than these two other groups. Who were the ASI? The Andaman Islanders are their distant cousins, separated for tens of thousands of years. But the most current genomics shows a clear submerged substrate from the Indian subcontinent into Southeast Asia. Coincidentally Southeast Asia has been strongly influenced by Indian culture. The ASI were closer to the populations of East Asia than to those of West Eurasia. Probably in part because East Asian populations are daughter groups from the modern humans who entered the Indian subcontinent from Africa tens of thousands of years ago. But the ASI are also quite distinct from East Asians. In some ways they represent a southern Eurasian population which seems to have been submerged within the last 10,000 years.

You can see shadows of their influence in this three dimensional visualization of genetic variation. Each point below is an individual projected onto a three dimensional space which is generated by the three largest components of variance within the data. The geographical clustering is pretty straightforward, but notice the “kink” in the South and Southeast Asians. That’s ASI’s shadow:

I just threw a lot out there for you to process. These results are pretty robust though. They’re based on hundreds of thousands of markers and there’s good population coverage. But their interpretation is more problematic. That’s because we don’t have records from prehistory. We are literally grappling with shadows. So let me address a few possibilities, and give my own take. All of these assertions are far less robust than what has come before because they are synthetic. They go beyond genomics, though they operate within the constraints that the new genomics imposes upon us.

Who were the ANI? I think they derive from a set of farming populations from between the Black Sea and the Caspian. The reason I think this is that there are suggestive associations with populations around the Caucasus with Indian groups, even more than with Iranians! This sort of “geographic leapfrog” requires a macrohistorical explanation.

Were the ANI Aryans? I don’t think so. The admixture event with ASI is very old. Likely within the last 10,000 years, but probably older than 4,000 years (I know this from personal communication with one of the researchers who attempted linkage disequilibrium decay based time-from-admixture tests). Some of the Caucasian groups which have an affinity with Indians are not Indo-European speaking.

So why did ANI arrive in India? I think it has to do with farming. Recent evidence is now pointing to massive reconfigurations of genetic variation across the world in the past 10,000 years. We have semi-historical evidence for nearly total replacement in Japan and Africa. But there is now a great deal of circumstantial evidence that the same occurred in Europe, at least once, and probably more than once. The ANI were one of the great farming Diasporas to pulse out of the Near East.

But why didn’t they replace ASI? I am not an archaeologist, so I am on weak ground here insofar as I’m relying heavily on others who know this stuff. But I suspect that the indigenous populations of the Indian subcontinent themselves had started an independent transition to farming. The ANI-ASI synthesis, both genetic and cultural, was that of two incipient farming toolkits. In contrast the relatives of the ASI in Southeast Asia did not enter into an independent phase of farming, and were marginalized to a far greater extent by populations from southern China (the exceptions being the Papuans). The Andaman Islanders then are exceptions, and not representative in their hunter-gatherer lifestyle.

What about the Aryans? The data from Europe is far thicker than from the Indian subcontinent, and there there is evidence for multiple movements and cultural influences. I believe that the Indo-Aryans arrived later, and are a minor overlay upon the ANI-ASI synthesis (South Indian tribals have 30-40% ANI, indicating how old and thoroughgoing the synthesis was). Some speculative suggestions can be made from the genetic data in regards to a post-ANI West Eurasian influence which does not seem Middle Eastern. I will leave that for now because we just don’t have much to go on, though I do suggest that one keep track of The Jatt Gene. I think the answers we’ve long been waiting for will be coming soon, especially with the imminent release of Indian populations from the 1000 Genomes.

The northwest-southeast axis is the dominant genetic story of India, but not the only one. There is a northeast-southwest axis. It seems probable that the Munda are relative newcomers as well. Though mostly Indian, there is an element of ancestry in these populations which suggests relatively recent affinities with East Asians. This is probably at least part of my personal story, so I take an interest in this “third wheel” component of our heritage.

South Indian Brahmins claim northern Indo-Aryan origins. The genetics certainly bear this out, albeit with some probable admixture with the local substrate. There are many specific questions which can be asked and answered. The Cochin and Bene Israel Jews of the west coast of India clearly do have highly elevated Middle Eastern components of ancestry, though they are highly admixed with the native populations. My own question: do the Nasrani Christians truly descend from Jews? I would have dismissed this outright a few months ago, but I am not sure sure now. The western coast of India seems to have long-standing connections to southern Arabia, so we need to flesh out these patterns in more detail.

What’s the biggest surprise from these results? For me I think it is the deep and incredibly thorough biological synthesis which characterizes the Indian subcontinent. We all know that there is a big difference between a Kashmiri Pandit and an Adivasi from South India. But about one third of the Pandit’s ancestry is “Ancestral South Indian,” which is almost absent outside of the subcontinent. And about one third of the Adivasi’s ancestry is “Ancestral North Indian,” which connects this individual with the populations which span the Atlantic, to the Urals, to the Sahara. The past is a strange and mysterious land. But the veil of ignorance is slowly lifting….

Note: Some might wonder why I didn’t address uniparental lineages. The post is long, that’s why. The short of it is that ASI seems to have a much stronger impact on maternal lineages, while ANI is more dominant in paternal ones. Additionally, among the Munda the East Asian element is far more frequent on the paternal lineages than the maternal ones. This indicates a consistent trend of deep time events of sex-biased migration.