As some of my readers know, I’m finishing writing a book on cosmopolitanism in a digital age. There’s lots of ways to think about cosmopolitanism; in my case, I’m thinking of the ways in which people build ties of friendship and information sharing across borders of language, nation and culture. People who have a lot of these ties are cosmopolitan, by my definition, while those whose ties are more locally bound are less cosmopolitan. One of the central questions of the book is whether the rise of the internet is leading towards higher levels of cosmopolitanism. (The answer: not necessarily, and not automatically.)

All well and good, but can we quantify these ideas?

I’ve been running a few experiments, looking for ways to quantify cosmopolitan connections. Some experiments have to do with media consumption. One that I discussed in my TED talk involved examining whether people in a given country read online news from outside the country. My data came from Google Ad Planner – I looked at the 50 most popular online news sites in a country (which usually represents more than 90% of online news consumption) and hand-coded whether they were domestically or internationally produced.

The results varied widely for the set of nations Google had data for. Virtually none of the news Chinese internet users read was produced outside the country (censorship is a likely contributing factor here), while residents of the United Arab Emirates gave 78% of their attention (as measured by pageviews) to sites outside the country. Other nations ranked between those extremes.

There’s a couple of possible explanatory factors for this distribution of news attention. China’s a really big country, whereas UAE’s pretty small – smaller nations may look outside their borders for news more often than large nations. (True, but there are some tiny nations with heavily domestic news consumption, like Croatia, where over 99% of news pageviews are domestic.) Nations with a large migrant population often read news from abroad, as migrant workers want to keep up with news from home. (This helps explain UAE’s apparent cosmopolitanism, as there are lots of Indian and Pakistani workers reading news from home, but doesn’t explain why Pakistan is so drawn to news from abroad, accounting for 44% of news pageviews in our set.)

One factor we were having trouble getting a handle on was the role of language. If you speak a language that’s spoken primarily in your country, like Finnish, you’ve got an obstacle to reading international news that an English or Spanish speaker doesn’t have. This seems to help explain some of the distribution, particularly the problem of understanding smaller nations. Smaller nations that share a language with a larger neighbor (Hong Kong or Taiwan with China, for instance) read lots of international news, while small nations with unique languages don’t read much international news (Hungary, Bulgaria, Finland.)

I came back to this problem recently and decided to figure out a way of quantifying this sort of “linguistic isolation”. I’m using the term a bit differently than how the US Census uses the term – they use it to refer to households where no one over the age of 14 speaks English, a situation that leaves them isolated from some sources of public information. I’m using the term to mean something different – how well does the dominant language of your nation affect your ability to engage with information produced in other countries? In my definition, someone in Hungary would be highly linguistically isolated, as Hungarian is spoken mostly in Hungary, while someone in Jordan would have very low linguistic isolation, as Arabic is spoken in many nations, including some nations much larger than Jordan.

Language data is a tricky thing to obtain. Most scholars rely on the World Bank and the UNDP for large, reliable data sets about economic and development issues, like national population or wealth. But neither UNDP or the World Bank release information on what languages are spoken where. The main source for that sort of information is Ethnologue, a remarkable resource that makes best efforts at determining where thousands of global languages are spoken. Their data is extremely rich and nuanced, but can be hard to use – it’s incomplete in some places, and so detailed in others that it can be hard to navigate.

I ended up using data from Worldmapper, a marvelous project that produces cartograms that reflect hundreds of different data sets. Cartograms distort a map to show a particular variable, expanding or shrinking a nation’s area to reflect the factor in question – a map of global population shrinks Canada and Russia, and expands India and Bangladesh, for example. The folks behind Worldmapper produce dozens of maps based on language, using data from Ethnologue and other sources… and god bless them, they’ve released their data as a single, very complicated Excel spreadsheet. Want to know what different languages are spoken in Canada and who speaks them? They’ve got that… and they’ve got the same data for almost 200 nations and almost 100 languages.

With a little bit of work, Worldmapper’s data turns into the statistics I need. Let’s consider France for a moment. Worldwide, 69 million people speak French as their first language. (Many, many more speak it as a second language, but for this metric, I’m restraining the conversation to first languages.) 48 million of the 59 million people who live in France speak French as their first language – because French is both the official language and the overwhelming majority language, I’m going to assume for a moment that it’s the only language spoken in France. If French speakers want to speak internationally using French, they’ve got an issue – roughly 70% of the people, globally, who speak French as a first language are in France. So we give France a 0.70 score on a linguistic isolation index. By contrast, Spain, where 27 million of 41 million citizens speak Castilian Spanish as a first language, gets a linguistic isolation index of 0.08. That’s because there are massive Spanish-speaking populations in Mexico, Colombia and Argentina. There are more people who speak Spanish as a first language in the US than there are in Spain!

Giving France a score of high linguistic isolation and Spain a score of low isolation is an oversimplification, of course. French is a very popular second language, and is often the language newspapers are published in, in countries where France used to have a colonial presence. This data set doesn’t help us beyond first languages, which gives it an interesting bias, away from multilingual elites and towards the broader population – Senegal shows up as a Wolof-speaking nation, not a French-speaking nation in this analysis.

For some countries, it’s clearly a mistake just to consider one primary language. Belgium has large populations of French and Dutch speakers. In either case, there are many more speakers of that language outside the country than within it. I calculated linguistic isolation indices for both groups, weighted them proportionally and gave Belgium a 0.21 score. I conducted similar calculations for South Africa, Canada and Singapore, which comes up as one of the least linguistically isolated worldwide, as it has large percentages of its population speaking popular global languages, including Chinese, English, Tamil, Thai and Malay.

Some of the results of this method end up being deeply counterintuitive. I would assume that Mongolia is one of the more linguistically isolated countries in the world. As it turns out, there are many more Mongolian speakers in China than in Mongolia – only 37% of Mongolian speakers worldwide live in Mongolia. (The isolation index is equivalent to the % of a language’s speakers who live in a country where it is a dominant language.) In other cases, the index seems to help explain the possible isolating role for language – since 93% of the people who speak Turkish live in Turkey, we would expect more Turks to read domestic newspapers than Spaniards.

Does linguistic isolation explain consumption of international news? It seems to help – looking at data from 31 countries, there’s some correlation (R2=0.38) between linguistic isolation and low international readership. But there are exceptions – Argentina and Chile both have very low isolation scores, but they don’t read a lot of Mexican or Spanish news… or even each other’s news. South Africans show high linguistic isolation (languages like Zulu and Afrikaans aren’t widely spoken outside South Africa), but read a lot of international media in English, though it’s a minority language. I’m looking forward to examining a larger set of media consumption data and trying this linguistic isolation score alongside other factors, like total population (small nations might read larger nations’ news) and migrant population (the desire to read news from home.)

I’m writing about this not because I think this is an especially novel or helpful idea, but because I’m wondering if someone else has done a better job of solving this problem. If you know of a data set or methodology out there that attempts to calculate the role of language in making it easier or harder to access information (news, culture) across borders, please let me know about it in the comments.