Geographically close to each other;

Similar in their phoneme inventories

Similar in their lexicon

Closely related historically (but this effect disappears when controlling for geographic proximity)

We also used Random Forests analyses to show that a language is more likely to be guessed correctly if it is often mentioned in literature, is the main language of an economically powerful country, is spoken by many people or is spoken in many countries.

We visualised the perceptual similarity of languages by using the inverse probability of confusion to create a neighbour net:

This diagram shows a kind of subway map for the way languages sound. The shortest route between two languages indicates how often they are confused for one another – so Swedish and Norwegian sound similar, but Italian and Japanese sound very different. The further you have to travel, the more different two languages sound. So French and German are far away from many languages, since these were the best-guessed in the corpus.

The labels we’ve given to some of the clusters are descriptive, rather than being official terms that linguists use. The first striking pattern is that some languages are more closely connected than others, for example the Slavic languages are all grouped together, indicating that people have a hard time distinguishing between them. Some of the other groups are more based on geographic area, such as the ‘Dravidian’ or ‘African’ cluster. The ‘North Sea’ cluster is interesting: it includes Welsh, Scottish Gaelic, Dutch, Danish, Swedish, Norwegian and Icelandic. These diverged from each other a long time ago in the Indo-European family tree, but have had more recent contact due to trade and invasion across the North Sea.

The whole graph splits between ‘Western’ and ‘Eastern’ languages (we refer to the political/cultural divide rather than any linguistic classification). This probably reflects the fact that most players were Western, or at least could probably read the English website. That would certainly explain the linguistically confused “East Asian” cluster. There are also a lot of interconnected lines, which indicates that some languages are confused for multiple groups, for example Turkish is placed halfway between “West” and “East” languages.

It was also possible to create neighbour nets for responses from specific parts of the world. While the general pattern is similar, there are also some interesting differences. For example, respondents from North America were quite likely to confused Yiddish and Hebrew. They come from different language families, but are spoken by a mainly Jewish population and this may form part of players’ cultural knowledge of these languages.

In contrast, players from African placed Hebrew with the other Afro-Asiatic languages.

Results like this suggest that perception may be shaped by our linguistic history and cultural knowledge.

We also did some preliminary analyses on the phoneme inventories of languages, using a binary decision tree to explore which sounds made a language distinctive. Binary decision trees identified some rare and salient features as critical cues to distinctiveness.

The future

The analyses were complicated because we knew little about the individuals playing beyond the country of their IP address. However, Hedvig and I, together with a team from the Language in Interaction consortium (Mark Dingemanse, Pashiera Barkhuysen and Peter Withers) create a version of the game called LingQuest that does collect people’s linguistic background. It also asks participants to compare sound files directly, rather than use written labels.

You can download LingQuest as an apple App, or play it online here.