Just How Weird Are the World’s Weirdest—and least Weird—Languages?

Oct 26, 2014 by Asya Pereltsvaig

[This post was originally published in September 2013]

In a recent post on idibon.com Tyler Schnoebelen asked which language is the “weirdest of all”. The most intuitive definition of the concept of language weirdness involves comparing languages to the native language of the person who does the comparing, most typically English. Here I must agree with Schnoebelen that “that’s a pretty irritating definition”. Nor is it particularly enlightening: any language that appears ordinary in comparison with English—a fixed-word-order, case-less language with articles and numerous vowels but a fairly modest consonant inventory and relatively bland consonant clusters—is bound to look weird to me as a native speaker of Russian, which has a free word order, a rich system of case marking, no articles, a much smaller vowel inventory and a more complex consonant system. Schnoebelen therefore takes a different approach: instead of comparing languages to one arbitrary standard of comparison (English!), he made a multilateral comparison of numerous languages to each other.

In order to do so, Schnoebelen drew data from the World Atlas of Language Structures (WALS), which evaluates 2,676 languages in terms of 192 different linguistic features. “These features include word order, types of sounds, ways of doing negation, and a lot of other things”, Schnoebelen explains. Some languages had to be excluded from consideration because of the paucity of typological information about them. Similarly, most of the WALS features were excluded either because few languages have been analyzed with respect to the given feature or because the feature reproduces the information found in some other feature (for example, from three features that describe the relative placement of subjects, objects, and verbs, Schnoebelen chose feature 83A: Order of Object and Verb). Altogether the full dataset includes 21 typological features and 1,693 languages (although the Weirdness Index was calculated for 239 languages for which a significant number of features is specified in WALS). Each language was evaluated in terms of how unusual it is in regard to each feature. Here’s Schnoebelen’s explanation:

“For example, English word order is subject-verb-object—there are 1,377 languages that are coded for word order in WALS and 35.5% of them have SVO word order. Meanwhile only 8.7% of languages start with a verb—like Welsh, Hawaiian and Majang—so cross-linguistically, starting with a verb is unusual. For what it’s worth, 41.0% of the world’s languages are actually SOV order. … For each value that a language has, we calculate the relative frequency of that value for all the other languages that are coded for it. So if we had included subject-object-verb order then English would’ve gotten a value of 0.355 (we actually normalized these values according to the overal entropy for each feature, so it wasn’t exactly 0.355, but you get the idea).”

The results of Schnoebelen’s calculations may surprise some people: among the more “weird”, typologically unusual, languages are familiar European tongues such as German, Dutch, Norwegian, Czech, and Spanish. English ranked #33 on Schnoebelen’s Language Weirdness Index. Even more unexpected is the placement of Mandarin Chinese in the top 25 weirdest, while Cantonese falls in the bottom 10. The top 25 “weirdest” languages are: Chalcatongo Mixtec, Choctaw, Mesa Grande Diegueño, Kutenai, Zoque, Paumarí, and Trumai (all spoken in the Americas), Pitjantjatjara and Lavukaleve (spoken in Australia and the Solomon Islands, respectively); Harar Oromo, Iraqw, Kongo, Mumuye, Ju|’hoan, and Khoekhoe (all spoken in sub-Saharan Africa); Nenets, Eastern Armenian, Abkhaz, Ladakhi, Mandarin, German, Dutch, Norwegian, Czech, and Spanish (all spoken in Eurasia). Thus, there is no clear geographical or genetic pattern to typological weirdness, as defined by Schnoebelen.

Similarly, the “least weird” (that is, typologically average) languages are also found all over the world and in different language families. Examples include Cantonese (Sino-Tibetan), Hungarian (Finno-Ugric), Chamorro (Austronesian), Imbabura Quechua (Quechuan), as well as a couple of isolates: Ainu and Basque. Surprisingly, the list contains several languages that few readers have probably heard of (I haven’t!): Bororo, a Macro-Ge language from Bolivia; Usan, a Trans-New Guinea language from central Papua New Guinea; and Purépecha, a Tarascan language from Mexico. While six fairly familiar Indo-European languages are found among the 25 weirdest languages (Eastern Armenian, German, Dutch, Norwegian, Czech, and Spanish), another Indo-European language—Hindi—tops the list of the “most normal” ones. In fact, Hindi has only a single weird feature among the 21 in Schnoebelen’s final analysis: it expresses predicative possession via a locational construction. For example, the Hindi equivalent of the English I have a dog is something like ‘At me there’s a dog’. This locational construction is found in 48 languages in the WALS sample (including Russian, Egyptian Arabic, Japanese, Uzbek, and Hungarian), compared to 63 languages that use the verb ‘have’ to express possession, such as English or French.

Schnoebelen’s conclusions are also surprising in one major conceptual way. Given that “weirdness” is defined as being typologically very distinctive from the rest, we would expect that the “least weird” languages would be largely similar to each other and to a prototypical “normal” language (i.e. a language that has the most frequent value for each feature; such language may not exist in actuality, but Hindi apparently comes close). Or to paraphrase the Anna Karenina principle, “normal languages are all alike; every weird language is weird in its own way”. A quick glance at the top-10 list of most “normal” languages, however, suggests that this is not so. Chamorro is a VSO (Verb-Subject-Object) language with little affixation and no case marking; Cantonese is an SVO language with obligatory classifiers and a complex tone system; Basque is an SOV language with a system of over 10 cases, organized on a split-ergative model; Hungarian has no dominant word order in a clause and, like Basque, features more than 10 cases, except it is organized on a nominative-accusative model; and so on.

In order to examine this issue in a more systematic fashion, I have compared the ten “least weird” languages from Schnoebelen’s list with respect to 20 typological features taken from WALS, some of which are included in Schnoebelen’s analysis and others that seemed typologically important to me. (I admit that this selection of features is based not on a statistical analysis but on a professional judgment, but that’s just the point.) The features are 1A: Consonant Inventories; 2A: Vowel Quality Inventories; 9A: The Velar Nasal; 11A: Front Rounded Vowels; 13A: Tone; 23A: Locus of Marking in the Clause; 26A: Prefixing vs. Suffixing in Inflectional Morphology; 33A: Coding of Nominal Plurality; 49A: Number of Cases; 55A: Numeral Classifiers; 57A: Position of Pronominal Possessive Affixes; 64A: Nominal and Verbal Conjunction; 65A: Perfective/Imperfective Aspect; 71A: The Prohibitive; 78A: Coding of Evidentiality; 81A: Order of Subject, Object, and Verb; 87A: Order of Adjective and Noun; 89A: Order of Numeral and Noun; 98A: Alignment of Case Marking of Full Noun Phrases; and 117A: Predicative Possession. (Only thirteen of these features are discussed below, for reasons of space. The full dataset is available upon request.)

With respect to all these features, the ten languages that are expected to be largely the same actually exhibit a great deal of variation from each other or from the “typologically average” pattern (or both). Consider, for example, the issue of consonant and vowel inventories (it should be pointed out that feature “2A: Vowel Quality Inventories” refers to vowel qualities alone, not to the number of vowel phonemes, an issue that led Quentin D. Atkinson astray in a different study). An average language is obviously expected to have average-sized consonant and vowel inventories. Yet, this is not so in regard to Schnoebelen’s list: Bororo and Usan both have small consonant inventories, Hungarian and Purépecha have moderately large consonant inventories, and Hindi has a large one. (This point highlights the importance of the selection of features: according to Schnoebelen’s choice of features, Hindi is “normal” except for one feature, but as we see here and will see further below, if different features are selected, Hindi becomes much less typical.) As for vowel (quality) inventories, at least three out of ten languages—Bororo, Cantonese, and Hungarian—have strikingly large inventories of 7-14 vowel qualities (WALS contains no information on the Imbabura Quechua vowel system).

Feature “23A: Locus of Marking in the Clause” is instructive because the majority of the ten “most normal” languages do not exhibit the typologically most common value. This feature refers to where notions such as subject, object, indirect object, etc. are tracked. The highest proportion of languages in the WALS sample (30%) exhibit the “head marking” pattern, meaning that the subject, object etc. are tracked via agreement morphemes on the verb; 27% of languages exhibit the “dependent marking” pattern, tracking subjects, objects, etc. via case marking on them; and 25% of languages do both head- and dependent marking. (Other languages have either no marking or do something else.) However, among the top-10 “most normal” languages, head-marking is not the most commonly pattern: only 2 out of 10 languages, Bororo and Usan, track grammatical functions via agreement on the verb. The most common pattern among those ten languages is dependent-marking, found in Quechua, Chamorro, Hungarian, and Purépecha. Double marking is found in Basque and Hindi (perhaps related is the fact that both are split-ergative languages, as discussed in more detail below); information on Cantonese is unavailable.

Another crucial “design” morphological feature is the use of prefixes vs. suffixes. Most of the world’s languages prefer suffixes over prefixes, and in many languages it is a strong preference. Among the top-10 “most normal” languages, only four fall in the “strongly suffixing” category: Quechua, Hungaria, Purépecha, and Hindi. Bororo shows a weak preference for suffixes; Basque and Ainu rely on prefixes and suffixes in equal measure, while Cantonese and Chamorro have little affixation of any kind. Chamorro uses productive patterns of full and partial reduplication, and Cantonese is largely isolating.

When it comes to the encoding of specific morphological concepts, such as plurality, case, gender, and so on, languages in the top-10 “most normal” list differ quite a bit as well. For example, only five languages use the most common strategy for encoding nominal plurality—a plural suffix (cf. English dog-s). Basque uses a plural clitic instead and Chamorro a plural word, whereas Cantonese has no morphological marking of nominal plurality at all. (No information is available for Bororo and Usan.)

As for case, the typologically most common option is to not have morphological case marking at all, yet at least three of the top-10 “most normal” languages feature morphological case marking systems and rich ones at that: Quechua falls into the “8-9 cases” category, whereas Basque and Hungarian are found in the “10 cases or more” category. In fact, Basque has at least 12 cases and Hungarian has at least 18 (depending on the analysis). Case systems in various languages are also organized differently, the most common pattern in the WALS sample being “neutral”, in which subjects of transitives, subjects of intransitives, and objects are all marked the same way (Usan, Chamorro, and Ainu belong to this category). But other languages in the top-10 “most normal” list vary widely as to the organization of their case systems. Quechua and Hungarian have nominative-accusative systems, marking the subjects of intransitives (e.g. He left) the same as subjects of transitives (e.g. He drank a beer) and different from objects (e.g., Mary loves him). In contrast, Basque has a case system organized largely on the ergative-absolutive model, marking subjects of intransitives the same as objects and different from subjects of transitives. But it is more complicated than that: although subjects of intransitives are typically marked the same as objects, a certain type of intransitives (such as ‘laugh’, ‘work’, ‘run’) are rendered as a combination of a noun and a light verb meaning ‘make’ and hence are syntactically transitive. For example, the Basque counterpart of ‘He laughed’ is literally ‘He made a laugh’, and therefore its subject is marked as a subject of transitive would be. Besides Basque, such a case system is found in Georgian, a language that some have tried to connect to Basque genetically. Hindi, on the other hand, has a “tripartite” (or “three-way”) case system in which subjects of transitives, subjects of intransitives, and objects are all marked differently: subjects of transitives are marked with the ergative marker -ne, objects are marked with the accusative marker -ko, and subjects of intransitive have no overt case marking. (To complicate matters further, case marking in Hindi also depends on the tense/aspect of the verb.)

Another important cross-linguistic difference is in the use of classifiers. While in English (certain) nouns can be counted directly—one boy, two boys, three boys, and so on—in some languages, such as Thai, the only way to count something is by using a special marker, called classifier, which makes things countable. Another way to look at this is to analyze all nouns in Thai-style languages as mass nouns, similar to English rice or salt: they can be counted only if a word like grain is used (e.g. one grain of rice, two grains of rice, etc.). In this respect, English belongs to the unmarked category of languages without classifiers. Yet not all languages in the top-10 “most normal” languages list belong to the same category; in fact, only four do. Another four languages make use of classifiers either optional (Hungarian, Ainu) or obligatory (Cantonese, Purépecha). (Once again, information on Bororo and Usan is missing.)

Similarly, the “least weird” languages also differ with respect to their use of evidential markers which express the evidence a speaker has for his/her statement. For example, if a Turkish speaker witnesses a murder committed by a butler, he can say the literal counterpart of “The butler it did-di” (the object ‘it’ comes before the verb in Turkish, but that is irrelevant for the matter at hand). If the same Turkish speaker inferred the butler’s guilt from some indirect evidence or hearsay, he would say the literal counterpart of “The butler it did-miş”. Those two little bits at the ends of these sentences—-di and –miş—are called evidential markers. Other languages, such as Yukaghir, have only a dedicated evidential marker for indirectly inferred events. As with classifiers, the most common option is to not use evidential markers at all, but only three out of ten “most normal” languages fall into this category: Usan, Hungarian, and Hindi. Basque incorporates evidentiality marking into its tense system; Quechua, Ainu, and Purépecha use a verbal affix/clitic to mark evidentiality; while Cantonese and Chamorro both use a separate evidential particle (there is no information on Bororo).

The marking of predicative possession is another way in which even the “most normal” languages differ. As mentioned above, Hindi uses a locational construction to express predicative possession, an analog of ‘At the man there’s a dog’. A similar construction is also used in Hungarian. Only three of the top-10 “least weird” languages use the most cross-linguistically common pattern involving the verb ‘to have’ (cf. English The man has a dog): Basque, Quechua, and Ainu. Bororo and Usan both instantiate yet another option, explained in the corresponding WALS chapter as follows:

“the possessor NP… is construed as the topic of the sentence. As such, the possessor NP indicates the “setting” or “background” of the sentence, that is, the discourse frame which restricts the truth value of the sentence that follows it. Its function can thus be paraphrased by English phrases such as given X, with regard to X, speaking about X, as far as X is concerned, and the like.”

In other words, the abovementioned English sentence would be rendered in Bororo and Usan as ‘As far as the man is concerned, there is a dog’.

Finally, let’s consider some typological features pertaining to word order. When it comes to the order of major clause constituents, I will examine the relative ordering of subjects, objects, and verbs, unlike Schnoebelen, who considers only the relative placement of objects and verbs. While the most common option is SOV, many languages in the top-10 “most normal” list fall into different categories: Cantonese and Purépecha are SVO, Chamorro is VSO, and Hungarian has no dominant word order (its word order is heavily dependent on what is known from the preceding discourse and what is new to the conversation). As for the order of elements inside noun phrases, here too the top-10 “most normal” languages depart from the cross-linguistically most common patterns. For example, adjectives more commonly follow the noun they modify rather than precede it, yet six out of the ten languages (Quechua, Cantonese, Hungarian, Chamorro, Ainu, and Hindi) exhibit the less common “Adjective-Noun” pattern (as does English, as in clean water, rather than *water clean). Similarly, the “Noun-Numeral” is the cross-linguistically more common pattern. Yet the majority of the top-10 “most normal” languages exhibit the opposite “Numeral-Noun” pattern, as does English, as in three dogs (rather than *dogs three). Thus, with respect to these last two features, although the top-10 “most normal” languages cluster together, they do not exhibit the expected pattern.

All of this goes to show that the choice of typological features predetermines what languages would appear “weird” or “normal” and thus the definition of “weirdness” proposed by Schnoebelen is as subjective and relative as the anglocentric one that he rejects to begin with.

Like this post? Please pass it on: Tweet