The final data includes more than 15 million guesses (after taking out guesses made by players with IP-addresses in a country where the target was official or de facto-official). The overall probability of guessing a language correctly was 70%. The language pair most likely to be confused is Punjabi and Kannada (Kannada is mistaken for Punjabi in 55% of trials where Punjabi is an option), and the language pair least likely to be confused is French and Vietnamese (Vietnamese is mistaken for French in 0.9% of trials where French is an option). French is the language that was identified correctly most often. The supporting information S2 Data includes a full list of confusion rates.

Fig 4 shows the probability of guessing the target language correctly given the number of alternative candidates, and demonstrates that players are performing above chance. Performance decreases until there are 6 alternatives, but then increases. Players presented with 11 candidate answers actually guess correctly more often than players guessing from 2 candidates. This is probably due to the structure of the game: context size keeps increasing, until the player has made 3 incorrect guesses, at which point the game stops. Therefore, only expert players experience a high number of alternatives. The impact of expert players on the data as a whole is small: only 2% of trials involve more than 6 alternatives.

The proportion of trials guessed correctly as a function of the number of alternative choices. The size of the circles indicates the relative number of trials that had a given number of candidates. The red line indicates the proportion of correct responses expected by chance.

In the following sections, we present the results of our investigations in relation to our research questions. The first questions relate to confusion while the latter relate to accuracy. The section ends with a summary of the results.

Q1: Which languages are confused for which?

In order to visualise the results, a Neighbor-Net was generated based on the confusion matrix for all guesses. Fig 5 shows the Neighbor-Net graph of the distance matrix from the GLG. The graph has been annotated for additional information; the colored highlighting of language names corresponds to language families and the polygons to geographical areas. First, we will introduce the concept of Neighbor-Net.

Each language is represented as a tip on a graph (next to a label with the language’s name). Tips are connected by a series of lines. The network is drawn so that for any two languages the length of the shortest path that connects them is approximately the distance between them in the input matrix. Therefore, two nodes that are connected by a short path are confused for each other more than two nodes that are connected by a long path. For example, there is a short distance between Tongan and Samoan, but a long distance between Tongan and Welsh. This means that Tongan is confused for Samoan more often than it is confused for Welsh. Languages like French and German are connected to the rest of the network by long lines, indicating that they have a high distance to any language, meaning that they are often guessed correctly. It is important to note that it is distance along the lines, not along the perimeter of the entire cluster that is key to reading a Neighbor-Net (i.e. the distance between the positions of two tips without travelling along lines is not meaningful).

One advantage of a Neighbor-Net is that it can display more complex relations between nodes, for instance a language being confused for two different groups of languages. The Neighbor-Net represents ‘confusability’ with a web of lines. For example, Armenian (Indo-European) and Turkish (Altaic) are often confused and have parallel lines connecting them making the total number of nodes that they share fewer than for example those shared by Turkish and Northern Sami (Uralic).

To appreciate the results, it is helpful to consider the extreme shapes that the graph could take. If every player guessed languages correctly almost all the time, then there would be little confusion and the Neighbor-Net would look like a star, with each language connected by an equally long line to a single central hub (similar to how German and French now stand out). If players guessed randomly, then the Neighbor-Net would look more like a web, with large areas of confusion and no clear meaningful clusters. In Fig 5, we see what would appear to be a moderate amount of confusion (delta score = 0.3, q-residual = 0.03), suggesting that players are not perfect at guessing languages, but not completely random in their confusion, since there are also clusters. For example, if players were reasonably good at guessing specific languages, but better at identifying Slavic languages in general from non-Slavic languages, then the Slavic languages would form their own ‘branch’ or cluster in the tree. In fact, that is exactly what we find! If the players were all historical linguists, making their judgments based on linguistic history (e.g. always guessing within the correct language family), then the clusters in the Neighbor-Net would reflect language families. We see some evidence for this, but also some counter-evidence. Similarly, if players were making judgments based only on phonology, then the Neighbor-Net would look like a Neighbor-Net produced with phonological distances, and if players were making judgments based on geographical location (e.g. they confuse languages that come from the same part of the world), then the Neighbor-Net would reflect geographical distances and, ideally, look similar to a geographical map of the world.

We can contrast visualisation of the Neighbor-Net with a binary branching tree. Fig 6 shows a tree calculated from the confusion distance matrix. These two visualizations complement each other by showing different relationships being displayed due to different properties of the graphs. Trees are undirected rooted graphs, showing exactly one relationship per node whereas a Neighbor-Net is a directed un-rooted graphs able to show multiple relationships. For example, the Neighbor-Net places Northern Sami, Latvian and Estonian relatively close (indicating conflicting signal), while the tree places Latvian and Estonian together, but Northern Sami in a more distant branch.

In the rest of this section on confusion, we explicitly test whether the distance matrix of player judgments correlates with genealogical, phonological or geographical distance, but from this very first visualization we can already make a few interesting observations.

The first major split that we see in the binary tree (Fig 6)-and also, less clearly, in the Neighbor-Net (Fig 5)-is between mostly languages of Europe on one side and languages mainly from Asia and Oceania on the other. The few languages of Africa present in the game (Amharic, Hausa, Arabic, Shona, Swahili, Northern Ndebele, Tigrinya and Dinka) are found on both sides. The languages of the Atlantic-Congo family (Shona, Swahili, Northern Ndebele) appear on the “Asia-Oceanic” side, but the Afro-Asiatic languages and in particular the Semitic ones are more complicated. The Afro-Asiatic and Semitic languages of Africa (Tigrinya, Arabic, Assyrian, Amharic) are found on the “European” side, together with Semitic Maltese and Hebrew (note that we are using the language names as they appear as an option in the game, i.e. “Hebrew” and we are referring to what is otherwise labeled as “Modern Hebrew”, not Ancient/Classical Hebrew). The other two Afro-Asiatic languages, Hausa (Chadic) and Somali (Cushitic), are found on the “Asia-Oceanic” side, close together with the Atlantic-Congo languages.

The east-west split can also be interpreted as a result of the bias in pool of players and the global fame of the languages, or more accurately lack thereof, considering that lesser known languages of Africa, Oceania and Asia are being lumped together.

The first split on the “European side” is between Slavic languages and the rest. The Indo-European languages, in particular French, German and Spanish, are often guessed correctly, there is less confusion there (see the supporting information S3 Data).

The Slavic languages form a coherent cluster, but the Romance languages do not in the same way. This is clearly illustrated both in the Neighbor-Net and the binary tree visualisations. There are only five Romance languages in our sample (Portuguese, Romanian, French, Spanish and Italian). French is most often guessed correctly overall (94%) and is therefore not particularly close to any of the other languages in the sample (the same applies to German). Spanish and Italian are also very often guessed correctly; if anything they are confused with each other. However, Portuguese and Romanian are associated with Greek and Albanian and also Slavic languages. In the case of Romanian this is most likely due to similarities in phonology with Slavic due to extensive contact (e.g. [47], p. 11–12), but perhaps also players’ cultural knowledge of Eastern Europe as a region of much cultural contact, both historically and in recent times (Soviet Union).

In the case of Portuguese, however, the similarities with the Slavic languages cannot be due to contact or culture as with Romanian, but rather coincidental phonological similarity. Portuguese shares 58% of its segments with Croatian and 53% of its segments with Polish (data from PHOIBLE). This is within the top 1% of most similar segment inventories in our data (mean shared segments = 26%). Besides the fact that Portuguese and Romanian both appear close to the Slavic languages, as do Albanian, they also are confused for each other. This is not surprising to the authors of this paper. One possible explanation is that Romanian and Portuguese are two Romance languages that are often perceived as “sounding Slavic”, due, at least, in part to the aforementioned contact (Romanian-Slavic) and the similarities in phonology (Portuguese-Slavic).

The Indo-European languages of Europe are more often associated with the non-related languages of the Uralic or Afro-Asiatic families than their Indo-European cousins in South Asia: Hindi, Bangla, Sinhalese, Gujarati and Punjabi. In fact, we also find Turkish and Basque closer to the Indo-European languages of Europe. Basque is an isolate language spoken in Spain and France, it has no known relatives. It is unclear if it is similarity due to contact that triggered this confusion or cultural knowledge of where it is spoken. In our comparison of phoneme inventories Turkish and Basque are similar to German and Spanish, respectively.

The Indo-European languages of the Indic and Iranian genera are usually clustered with other languages in their geographical vicinity, i.e. in Southern and Western Asia. Farsi and Dari stand out among the Indo-European languages spoken in Asia as being slightly closer to the western Indo-European languages and to Semitic languages than to other eastern Indo-European languages. This can also be seen in the binary tree (Fig 6) where Farsi and Dari are more tightly clustered with Hebrew and Armenian than with languages of South Asia.

The four Uralic languages appear close in both visualisations, with the interesting addition of Latvian. Latvian is an Indo-European language of the Baltic branch, spoken in close contact to Estonian (Uralic). This pattern might be due to contact influence in the phonology of Latvian. It is also important to take into account that there are no other Baltic languages in the sample that Latvian could be associated with, i.e. Lithuanian for example is not included. Based on genealogy one might expect Latvian as a Baltic language to appear closer to the Slavic languages [48] (page 222) [49]. However, this connection is more than 3,000 years old and the shared features are most likely not salient to players in these short speech samples. The cultural knowledge that Latvian is spoken close to Uralic languages and/or contact influence from these languages is probably a stronger factor.

On the whole, there is a lot of conflicting signal in the Slavic cluster, indicating that players confuse Slavic languages for each another. However, this is not symmetrical: every Slavic language is more often confused with Russian than the other way around (this is discussed in more detail later in our paper), resulting in Russian being more removed from the other Slavic languages in the Neighbor-Net visualisation. This indicates that Russian is the ‘prototypical’ Slavic, or at least the one that players associate most with Slavic sounds.

It is important to note that Slavic is the most well-represented subfamily/genus in the entire language sample; there are 11 Slavic languages, compared to 7 Germanic and 5 Romance (take into account that, as mentioned earlier, Bosnian, Serbian, and Croatian are classified as 3 different languages in Ethnologue, Glottolog and in the Great Language Game, but that this is not necessarily done everywhere else). For more information on the representation of families and genera, see supporting information S1 Data.

Another interesting observation is the clustering of the Austronesian languages. In both Figs 6 and 5, they are split between Austronesian languages of Oceania (Maori, Tongan, Samoan, Fijian and South Efate) and Western Austronesian languages (Malay, Indonesian and Tagalog). The Oceanic languages are associated with some of the languages of Africa in the sample (Dinka of the Nilo-Saharan family, Northern Ndebele, Shona and Swahili of the Atlantic-Congo family and one member of the Afro-Asiatic family: Hausa). The Western Austronesian languages are found to be closer to languages of East Asia and India, particularly Dravidian languages.

In contrast to a Neighbor-Net, a binary tree forces a language to appear with one particular group, which can lead to strange clusters. In the binary tree the Fijian (Oceanic, Austronesian) appears alongside Malayic Austronesian languages and languages of the Indian continent. This is likely to be an artefact of the binary tree being unable to handle the complex relationships in the data; in the Neighbor-Net, Fijian appears with the other Oceanic Austronesian languages, which intuitively makes more sense.

Now we turn to Korean and Japanese. The possibility of a genealogical connection between Korean and Japanese has long been a topic of debate in linguistics [50]. It is widely accepted that Japanese forms a small family together with the languages of the Ryukyu islands (these are however not present in the game)—but is Korean to be included there, viewed as an isolate language, or related to another grouping? There are researchers who group Korean and Japanese as part of an even greater family—Altaic. The macro-Altaic hypothesis (we use the term ‘macro-Altaic’ to refer to the Altaic hypothesis that includes Tungusic, Mongolic, Turkic, Japanese, Ainu and Korean, as opposed to simply ‘Altaic’ or ‘micro-Altaic’ that only covers Tungusic, Mongolic and Turkic) also includes Mongolian and Turkish [51]. Glottolog, which was used as the information source for linguistic genealogy in this paper, does not place Korean and Japanese in the same family, nor does the Ethnologue. It is unclear if the confusion in the GLG between Korean and Japanese is due to phonological similarity, geographical proximity, genealogy or players’ cultural knowledge. Either way, the players of the game have made a connection between Japanese and Korean. In fact, they confuse several languages of East Asia, South Asia and Mainland Southeast Asia with each other, despite these languages being of different language families. Players are however not confusing Japanese and Korean with the only other macro-Altaic member present in the game—Turkish—particularly often.

This cluster of languages from East Asia (Cantonese, Mandarin, Japanese and Korean), South Asia (Central Tibetan) and Mainland Southeast Asia (Khmer, Burmese, Lao and Thai) is quite clearly visualised in the Neighbor-Net. This cluster is made up of languages from at least 4 different families: Sino-Tibetan, Austro-Asiatic, Tai-Kadai, Japonic and the isolate Korean. Several of these languages are in a well-known contact area and share many features, in particular the salient features of having tone and sesquisyllabicity [52, 53]. Khmer and Korean are the only languages in the cluster that lack tone (there are recent studies showing that varieties of Khmer and Korean are developing tone [54, 55], but it is unclear how salient these distinctions are to a non-native speaker as it is a change in progress with restricted distribution). Besides sharing similar features, they might also be lumped together because players (in particular those from Europe, the US and Australia) may perceive these languages and people as a cultural unit (“Asian”).

Languages of Africa are less coherent. The Atlantic-Congo languages cluster together, as do some of the Afro-Asiatic languages, but the two groups are far apart from each other.

At a macro-scale, the clustering appears to be partly based on geography. For example, the graph as a whole splits into languages from the ‘East’ and ‘West’, with coherent clusters for languages for Europe, India and South-East Asia. Within the European languages there appears to be a sub-group composed of languages bordering the North Sea (Swedish, Norwegian, Danish, Icelandic, Dutch, Scottish Gaelic and Welsh). Since there are both Germanic and Celtic languages in this group, geography seems a more parsimonious explanation than genealogy. One possible explanation for the cluster is similarity of phonologies due to contact along sea trading routes (e.g. [56]).

There are some patterns that are inconsistent with language family history. Yiddish for example (an Indo-European language) shares a branch with Hebrew (an Afro-Asiatic language). Yiddish and Hebrew may be of different language families, but they are both languages spoken by a mainly Jewish population and this may form part of players’ cultural knowledge of these languages. In addition, during the revival of Hebrew, Yiddish had a considerable influence on the phonology [57, 58].

If players are using cultural knowledge such as shared culture, history and religion of speakers of Yiddish and Hebrew, then we might expect Neighbor-Nets constructed from judgments by players from different parts of the world to look different. We constructed Neighbor-Nets for several geographic regions (continents, according to the ISO convention) and we do find some differences: players from North America associated Yiddish and Hebrew more with each other than players from Africa or Asia do. In fact, African players confused Hebrew and other Afro-Asiatic languages more often than Hebrew and German—which is more in line with language history. Neighbor-Nets constructed from responses from each separate continent can be found in supporting information S4 Data.

We also calculated a confusion matrix for responses from each country with more than 50,000 data points (42 countries). All pairs of countries were significantly correlated in their confusions (Mantel p < 0.01, adjusted for multiple comparisons using Holm’s method [59]), but the correlation strengths vary from r = 0.25 to r = 0.95. This suggests that the judgements of players from different countries are more similar that one would expect by chance, but there are still subtle differences. The country pairs with the highest similarity are the UK with the USA, and Argentina with Mexico. The country pairs with the lowest similarity are France with China, and Poland with Slovenia. From these examples, one might make two predictions about what would predict similarity in judgements: the amount of cultural similarity and geographic proximity. However, preliminary post-hoc analyses found no statistical support for this. Confusion distance actually has a weak negative correlation with geographic distance between country capitals (Mantel r = −0.13, p = 0.09). Also, the proportional migration between two countries does not predict the similarity of judgements between them (data from [60], Mantel r = −0.08, p = 0.24). Supporting information S4 Data includes a Neighbor-Net produced from the judgement differences between countries.