We hypothesize that language structures are subjected to different evolutionary pressures in different social environments. Just as biological organisms are shaped by ecological niches, language structures appear to adapt to the environment (niche) in which they are being learned and used. The proposed Linguistic Niche Hypothesis has implications for answering the broad question of why languages differ in the way they do and makes empirical predictions regarding language acquisition capacities of children versus adults.

We conducted a statistical analysis of >2,000 languages using a combination of demographic sources and the World Atlas of Language Structures— a database of structural language properties. We found strong relationships between linguistic factors related to morphological complexity, and demographic/socio-historical factors such as the number of language users, geographic spread, and degree of language contact. The analyses suggest that languages spoken by large groups have simpler inflectional morphology than languages spoken by smaller groups as measured on a variety of factors such as case systems and complexity of conjugations. Additionally, languages spoken by large groups are much more likely to use lexical strategies in place of inflectional morphology to encode evidentiality, negation, aspect, and possession. Our findings indicate that just as biological organisms are shaped by ecological niches, language structures appear to adapt to the environment (niche) in which they are being learned and used. As adults learn a language, features that are difficult for them to acquire, are less likely to be passed on to subsequent learners. Languages used for communication in large groups that include adult learners appear to have been subjected to such selection. Conversely, the morphological complexity common to languages used in small groups increases redundancy which may facilitate language learning by infants.

Languages differ greatly both in their syntactic and morphological systems and in the social environments in which they exist. We challenge the view that language grammars are unrelated to social environments in which they are learned and used.

Funding: GL was supported by an Integrative Graduate Education and Research Training (IGERT) award to the Institute for Research in Cognitive Science, University of Pennsylvania. RD was supported by National Science Foundation BCS-0720322 and BCS-0826825. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Introduction

Although the largest languages are spoken by millions of people spread over vast geographic areas, most languages are spoken by relatively few individuals over comparatively small areas. The median number of speakers for the 6,912 languages catalogued by the Ethnologue is only 7,000, compared to the mean of over 828,000 [1]. Similarly, for the 2,236 languages in our sample (Figure 1), the median area over which a language is spoken is about the size of Luxembourg or San Diego, California (948 km2). The mean area is about the size of Austria or the US state of Maryland (33,795 km2). Languages also differ dramatically in the proportion of individuals who speak the language natively (L1 speakers) to those who learned it later in life (L2 speakers) (Table S1). Although there are numerous counter-examples (Text S1), languages spoken by millions of people have a greater likelihood of coming into contact with other languages and of having numerous nonnative speakers compared to languages spoken by only a few thousand people. This is not surprising: a language spoken by more people is more likely to encompass a larger and more diverse area and include speakers from varying ethnic and linguistic backgrounds. Conversely, languages spoken by a thousand or even fewer individuals tend to be spoken in highly circumscribed locales (Text S2). Overall, languages with smaller speaker populations are more likely to be spoken by more socially cohesive groups [2] than languages that have millions of speakers.

Just as there are socio-historical and demographic differences among the world's languages, there are also vast differences among languages in morphology and syntax [3]. For example languages differ in the devices used to convey syntactic relations—who did what to whom. Some languages rely on a fixed word order (Subject-Verb-Object in the case of English), while other languages (e.g., German, Polish) allow much more flexibility in word order and rely on case markings to signal which noun fills the role of subject, object, etc. [4] More generally, languages differ in the amount of information conveyed through inflectional morphology compared to the amount of information conveyed through non-morphological devices such as word order and lexical constructions. For example, compare morphological marking of aspect in Russian “Ya vypil chai” (I PERFECTIVE+drank tea), to the English lexical strategy, “I finished drinking the tea.” Some other domains exhibiting such differences between lexical and morphological strategies include tense, aspect, evidentiality, negation, plurality, and expressions of possibility.

Languages with richer morphological systems are said to be more overspecified [5]–[7]. For instance, of the languages that encode the past tense inflectionally, about 20% have past tenses that explicitly mark remoteness distinctions. For example Yagua, a language of Peru, has inflections that differentiate 5 levels of remoteness. A verb denoting an event that happened only a few hours ago takes the suffix –jásiy; an event that happened a day previous to the utterance requires a different suffix, -jay; an event that occurred a week to a month ago, a still different suffix, -siy, etc. [8]. Of course, languages without these grammatical distinctions can express them lexically, as in English: “I broke my foot a few years ago.” On the other hand, when semantic distinctions are encoded grammatically, speakers are generally obligated to make them [9], hence sentences concerning the past will have its remoteness specified even when it may not be relevant to the discourse. In the English example above, speakers have the option to omit remoteness information, but are obligated to express the grammatically encoded past tense (which leaves remoteness to context). In Mandarin or Thai, which express both tense and remoteness lexically, speakers have the option of omitting the past tense entirely. Of the 222 languages in our corpus for which tense information is available, 40% do not encode past tense inflectionally [10].

The degree and specificity of morphological encoding can reach astounding levels. For example, Karok—a language of N.W. California—has morphological suffixes for forms of containment pa:θ-kirih “throw into fire”, pa:θ-kurih “throw into water”, pa:θ-ruprih “throw in through a solid” (the affixes are unrelated to the lexemes for water, fire, etc.) [11]. Clearly, such elaboration does not arise from communicative necessity. Researchers have long been puzzled by the reasons why some languages abound in such overspecification, while others (sometimes closely related ones) eschew it. For example, in comparing English and German we find that where the surface structures of English and German contrast, English is less specified, leaving more to context [6], thus, “…German speakers are forced to make certain semantic distinctions which can regularly be left unspecified in English” [6], p. 28). For example, German obligatorily specifies the direction of motion in the place adverbs here/there/where. Compare: hier/her; dort/hin; wo/wohin. English can specify direction using to and from (“where to” versus “where from”), but such specification is optional and is generally omitted [12], [9]. Grammatical divergence between languages has been typically attributed to drift—as a population speaking an ancestral Germanic language splits into separate groups, their language gradually diverges with one branch becoming English and the other German [13]. Such accounts do not explain why English came to shed much of its morphology while German retained it.

Attempts to establish relationships between social and linguistic structure date back at least a century [14]–[16]; see [17] for a review. Recent work has provided some support for the idea that extralinguistic factors (e.g., degree of ecological risk) play a role in some aspects of language such as varying levels of linguistic diversity in different parts of the world [18], [19]. A number of researchers have investigated correlations between social environments and the phonological structure of languages [20]–[22] and, intriguingly, have also found correlations between physical aspects of the environment such as temperature, and phonological inventories [23], [24]. It has also been argued that the physical environment [25], and historical developments that impact language transmission can impact the syntactic and morphological structure of languages [2], [5], [26], [27].

Languages with histories of adult learning have been argued to be morphologically simpler, less redundant, and more regular/transparent [2], [7], [28]–[30]. This argument has been made most forcefully and convincingly for Creole languages [26], but it has been speculated that any situation in which a language is learned by a substantial number of adults it becomes simplified due to the “lousy language learning abilities of the human adult” [28]. The evidence for such linguistic simplification has been largely descriptive, consisting of selected examples and grammatical inventories of small numbers of languages [17], [14], [29], [7], [5] . Thus, at present, there is little convincing evidence of global relationships between linguistic structure and non-linguistic factors and limited theoretical frameworks within which to understand such relationships [e.g., 20 for the case of phonological inventories]. An additional limitation of previous work is that it fails to explain why morphological complexity and grammatical overspecification arise in the first place. That is, why aren't all languages as morphologically simple as those that have been argued to be heavily shaped by adult learning, e.g., English [12]?

The primary goal of the present work is to examine whether non-spurious relationships exist between social and linguistic structure by using large-scale demographic and linguistic databases. A secondary goal is to provide a tentative framework within which to understand the reported results—the Linguistic Niche Hypothesis—which provides a nomothetic account for understanding relationships between linguistic and social structure (Text S3).

In assessing the relationship between social and linguistic structure, it is useful to distinguish two main contexts (niches) in which languages are learned and used: the exoteric and the esoteric [2], [31]. The exoteric linguistic niche contains languages with large numbers of speakers, thus requiring these languages to serve as interfaces for communication between strangers. In reality the esoteric and exoteric niches form a continuum, and are represented as such in our analyses (see also Text S4). Speakers of languages in the exoteric niche compared to speakers of esoteric languages are more likely to (1) be nonnative speakers or have learned the language from nonnative speakers, and (2) use the language to speak to outsiders—individuals from different ethnic and/or linguistic backgrounds. The exoteric niche includes languages like English, Swahili, and Hindi, while the esoteric niche includes languages like Tatar, Elfdalian, and Algonquin.