One of the many frustrations I have faced when trying to understand South Asia is the near total lack of recent data on which languages are spoken and where. The lack of interest in South Asian languages is stunning, especially given that South Asia is home to some of the most spoken languages in the world. The language everyone has heard of is Hindi/Urdu (essentially one language with two scripts), which is spoken by over 300 million people, even if the closely related Rajasthani and Bihari languages are excluded. In the West though, awareness of the other South Asian languages is low. Just to give an idea of how large many of these languages are, here are some comparisons: as many people speak Punjabi as Japanese; roughly as many people speak Bengali as German, French, and Italian combined; as many people speak Oriya as Ukrainian; Pashto has as many speakers as Polish; Marathi, Telugu, and Tamil each has more than three times as many native speakers as Dutch. I have searched for up-to-date statistics for language in India, but haven’t been able to find anything more recent than the 1931 Census. I was able however, to track down tehsil-level date for Pakistan from the 1998 Census. In Pakistan, tehsils are the third level of administrative divisions, after provinces and districts. The data set I found isn’t perfect (more on that later), but it has most of what I was looking for. The data can be downloaded here, and the site also has a link to a pretty cool interactive map.

Before I post the map, I’m going to give a quick rundown on language in Pakistan. English and Urdu are the national languages, and are widely understood, at least by the educated. English, obviously, is nobody’s first language in Pakistan, and Urdu is the first language of about 7% of the population, mostly descendents of immigrants from north India who arrived in 1947. The most widely spoken tongue by far is Punjabi, which is the first language of slightly less than half the population. When Saraiki and Hindko, two Punjabi dialects that are sometimes classified as separate languages, are included, well over half of Pakistanis speak Punjabi or a closely related language. As anyone who read my post on the partition of Punjab will know, a large population of Punjabis (about 35 million) live across the border in India. The second most widely spoken language is Pashto, which unlike Punjabi, an Indo-Aryan language related to Hindi, is an Iranian language. This makes it a relative of Farsi and Kurdish, although Pashto’s closest relatives are a cluster of minor languages known as the Pamir languages which are spoken on the mountainous border between eastern Tajikistan and northeastern Afghanistan. Pashto, like Punjabi, is split between two countries. It is the dominant language in southern Afghanistan, but the majority of Pashtuns live in Pakistan. About 15% of Pakistanis speak Pashto as a first language. Right behind Pashto at 14% is Sindhi, which is a relative of Punjabi. There are a few million Sindhi speakers in India as well, some right on the opposite side of the border, and some Hindus who fled Sindh after Partition. The other major regional language is Balochi, spoken by by about 4%. Balochi, like Pashto, is an Iranian language, though it is not particularly closely related. It is actually closer to Kurdish, leading to the theory that the Baloch may have migrated to their current location fairly recently from the Middle East. Balochi is also spoken in southern Afghanistan and eastern Iran. There are some other minor languages, which I’ll discuss later, but those are the major languages. Below is the Pakistan language map.

Note that I colored the Saraiki and Hindko speaking areas shades of blue because it remains undetermined whether they are separate languages or dialects of Punjabi. Since I don’t speak any of these languages, I can’t make a determination for myself, so I split the difference by making them different shades of the same dark blue. I should also mention two other problem areas. One is the central Balochistan area, which is traditionally considered the Brahui zone. Brahui is a fascinating language. It is Dravidian, which means that it is related to the major South Indian languages, such as Tamil and Telugu, but it is spoken far away from the other Dravidian languages. Brahui barely registers in these data. There are several possibilities. One is that Brahui has lost ground to Baloch. Another is that the Brahui learn both Balochi and Brahui and are equally comfortable in both, leading most to identify the dominant Balochi language as their native tongue. According to some sources, the Brahui have a complicated system of code-switching in which people use Brahui in some situations and Balochi in other situations. Apparently, even within families, there are some times Balochi is used (elder son addressing father), and other times Brahui is used (younger son addressing father). The father speaks to the children in the language of the mother, and wives address their husbands in Balochi. This all seems crazy, but if true could explain why many Brahui would feel comfortable calling Balochi their native language. In any case, it seems that almost all Brahui are fluent in Balochi. Just as a side note, Ethnologue (and Wikipedia) say Brahui is spoken by four million people. This is a ludicrous number, implying that Balochistan, which has 7 million people, is majority Brahui-speaking.

The other problem area was the far north, including northern Khyber-Pakhtunkhwa, Gilgit-Baltistan, and Azad Kashmir. I couldn’t find Census data on Gilgit-Baltistan and Azad Kashmir (which combined make up Pakistan’s part of Jammu and Kashmir). As a result I had to look around the internet for information on these areas. In Azad Kashmir, I had to distinguish between Hindko and Pothohari, another Punjabi dialect being pushed as a separate language (included with Punjabi in this map). It’s a bit difficult to figure out where one begins and the other ends, but it seems that Hindko is spoken in Muzaffarabad, and south of that it is Pothohari. In Gilgit-Baltistan, I was able to use this survey from the early 1990s, which goes into some detail about the northern languages. The other problem is that the languages of northern K-P (Hindko, Khowar, and Kohistani) are all grouped under “other” in my data set. Luckily the geographic ranges of these languages are fairly well known and distinct, so it was easy to figure out which “other”-speaking areas belonged to which language.

I have already mentioned Hindko, but I’ll quickly go through the other six languages that show up on the map in the north. Three of the languages, the aforementioned Khowar and Kohistani, as well as Shina, are Dardic languages, the most northwestern branch of the Indo-Aryan language family. The Dardic languages form an arc in the far north of South Asia. To the southeast in Indian Kashmir, Kashmiri is the most spoken Dardic language. On the other side, in Afghanistan, Pashayi is spoken by perhaps half a million people south of Nuristan province. The other languages in the north are Burushaski (in brown), a language isolate with no known relatives, Wakhi (light purple) which is related to Pashto, and Balti (orange), which is related to Tibetan, and is spoken in Indian Kashmir, though the dialect there is called Ladakhi. The Baltis are almost exclusively Shia; the Ladakhis are split between Shia and Buddhist.

Hopefully this map underscores how linguistically diverse Pakistan is, and possibly explains why the country is so fragmented. Two other features worth noting are the huge swath of northern Balochistan that is Pashto speaking. The 1998 statistics put Pashto speakers at around 30% of Balochistan’s population, but with high birth rates and a surge of refugees from Afghanistan in the last decade, the Pashtun and Baloch populations in the province may be approaching parity. It is also worth noting the tiny presence of Urdu, the national language. While most educated people in Pakistan can speak Urdu, and almost everyone has at least a rudimentary knowledge of it, very few people speak it as a first language. Only the Sindhi cities of Hyderabad and Karachi are majority Urdu speaking. Hyderabad and Karachi were among the only significant Hindu-majority areas of British India that went with Pakistan, and it is possible that the Urdu speakers leaving India went there simply due to the availability of real estate once the Hindus left. Punjab would have been a more logical destination given Lahore’s traditional position as the most important city in northwest India, but Punjab was already overrun with Muslim refugees from India. Sindh wasn’t partitioned, which means it had to absorb fewer refugees. That might explain why the powerful Urdu-speaking community chose the cities of this arid backwater province as their new home.

This map also highlights two large movements for new provinces. The southern Saraiki-speaking Punjab has long had advocates for severing it from the north and creating a separate province centered on Multan. It is unclear how popular this demand is with the average citizen, but the movement has been active since the 1960s and shows no sign of going away. The other potential province would be in the non-Pashto speaking north of K-P. This province would be called Hazara and would be majority Hindkowan (the ethnic group that speaks Hindko).

The final interesting aspect of Pakistan’s linguistic mix is that the border between the Indo-Aryan languages of north India and the Iranian languages runs right through it. This fact, plus the detailed data set I found, gives us the unusual opportunity to investigate the boundary between two major language families. The Indo-Iranian languages form the largest branch of the Indo-European language family. It is typically split into the Iranian branch (Pashto, Farsi, Kurdish and others) and the Indo-Aryan branch (Hindi, Punjabi, Bengali, Marathi and many others). The Iranian and Indo-Aryan languages diverged about 4000 years ago. While South Asia and Iran share many cultural similarities, they are markedly different civilizations. Most of the Iranian peoples share a basic history and culture as do most Indo-Aryans. Below is the map of the border between the Iranian languages and the Indo-Aryan ones.

To me, there are two notable features of this map. The first is the intrusion of Indo-Aryans into central Balochistan. These people are a mix of Sindhi, Saraiki, and Punjabi, which explains why they didn’t register on the first map, since Balochi speakers remain the plurality. Added up though, several tehsils have an Indo-Aryan majority. That corridor between northern Sindh and Quetta is pretty important, because it connects Quetta, and ultimately Kandahar, to the Pakistani heartland. It is also a major gas producing area for Pakistan. I wonder if the non-Baloch people there are workers who are employed in the gas fields and related industries. That area is also a hotspot for militancy. Perhaps Baloch militants strike there to get at the “foreign occupiers” who are stealing Balochistan’s resources (a common complaint of Balochistan’s active separatist movement).

The second, more macro, feature is the sharp line between the Indo-Aryan languages and the Iranian ones. There are very few parts of Pakistan with mixed communities. This is not at all what I expected. Given that all of these languages, except Urdu, are poorly standardized, I expected the distinctions between them to be hazy. Instead, we see many instances where a 95% Pashto district borders a 95% Punjabi district. This is fairly similar to Western Europe, where the language boundaries tend to be sharp. One doesn’t find many mixed German and Polish towns, or French and Italian. In Europe, most languages are highly standardized and the national boundaries were made to coincide with language borders often through ethnic cleansing. Neither of these is the case in Pakistan. I expected Pakistan’s language map to look a bit more like Southeast Asia’s.

Pakistanis (and Indians) do have very strong ethnic identities. Sindhi speakers know that they are Sindhi and care about the distinction with Balochis. The same is true of Punjabis and Pashtuns. The lack of ethno-linguistic mixing could explain why Pakistan has had such a hard time constructing a national identity. It also could be one of the reasons Pakistan has been so slow to react to the threat of radical Islamic militancy. The vast majority of terrorist attacks in Pakistan happen in Pashtun dominated areas. Since there are few Punjabis or Sindhis living near Pashtuns, those attacks are out of sight and out of mind for the majority of Pakistanis.