What Is Linguistic Complexity and How to Measure It?

Nov 19, 2015 by Asya Pereltsvaig

[Note to Readers: the following is my translation from Russian of Alexander Piperski’s lecture posted on PostNauka.ru, posted here with his permission.]

We all have some ideas about what languages ​​are simpler and, conversely, what languages ​​are more complex. If you ask a man on the street, what languages ​​are the most complex, you would usually get a standard set: the most difficult languages are Chinese, Korean, Japanese, Arabic. It’s clear that this is primarily due to the fact that these are languages whose writing systems are unfamiliar to us. And it is also clear that this is absolutely not what linguists find interesting because writing in general is secondary in relation to the spoken language. In addition, stereotypes about a given language’s complexity are often associated with the idea that languages closely related to ours are simple, while those that are not related to our language are complex. For example, a speaker of Russian would consider the Serbian language as very simple. He can come to Serbia and in a week he would somehow begin to understand what is happening around and to communicate. And, for example, the Estonian language would be considered very difficult, nothing can understood, a Russian speaker cannot learn it in a week. But, say, for the Finns, whose language is closely related to Estonian because it is also one of the Finno-Ugric languages, Estonian may seem simple, but Serbian – complicated, so the opinions [of a Russian and Finnish speakers] would be diametrically opposed.

Linguists are, of course, interested in some objective assessment of complexity. In general, it would nice to find out whether there are actually more simple or more complex languages regardless of how they are written and who learns them. Thus, let’s say if a Martian came to our planet and would have to learn some different languages ​​in their oral form, as it is the primary form of language, whether it would be more difficult for him to learn Finnish or Serbian or Chinese or Hindi. That’s the question, in fact, that the linguistic study of linguistic complexity tries to answer.

This area of ​​science is relatively young, and it actually began to develop actively only in the last 20-25 years. Before that, linguists held as an axiom the notion that all languages ​​are of equal complexity. It was in some ways helpful because it allowed linguists not to extol some languages ​​over others, that is, not to make value judgments. Nevertheless, when all the linguistic community has finally understood and realized that all the 6,000-7,000 languages ​​that exist on our planet are equivalent as objects of study, we were able to set ourselves this question: “So what languages are more or less complex?”

How to measure linguistic complexity, it is not at all obvious and not completely understood. Here, linguistics uses ideas that come from information theory. Russian mathematician Andrei Kolmogorov introduced a formal definition of complexity, which is now called “the Kolmogorov complexity”. The complexity of an object is the length of the most economical descriptions of the object in some formalized description language. I am, of course, simplifying it, without going into the mathematical details of the wording, but that’s the way it works. For example, if we have a sequence of characters ABBVABBVBVBABA, then this sequence can not be described in any economical way. If we have ABABABABABAB sequence, this sequence is easy to describe sparingly: AB six times. And, accordingly, the first sequence of the more complex one, the second is the simpler one. But this does not apply very well to reality because it is clear that in order to compare the grammars of natural languages this way, we need to have such grammars, written for that hypothetical Martian and using some common principles, and it is obvious that such grammars do not exist. There are no such clearly formalized description languages which would be applicable to all of the world’s languages so we have to look for some correlates of language complexity that can be measured in order to calculate what languages ​​are more complex, and what languages are simpler.

There are many such correlates of complexity that we can find. Firstly, it’s the diversity of elements. For example, if a certain language has 8 consonants, and some other language has 60 consonants, it is obvious that the consonant system of the first language is simpler than that of the second. Secondly, an important thing is the lack of one-to-one matching between the form and meaning at the level of grammatical rules. For example, if the same form is formed in a certain language ten different ways, it is more complex than if the form is formed in just one way. For example, in English the plural form of 99% or more of nouns is formed regularly using the same ending, whereas in German, there are many different models of declension. For example, the word Baum (‘Tree’) has the plural Bäume, but the word Vater (‘Father’) has the plural Väter, and the word Land (‘Land’) has the plural Länder. All this variety of plural formations, of course, leads to the conclusion that the plural of nouns in German is more complex than their counterparts in English.

Another correlate of complexity is the lack of one-to-one correspondence between form and meaning, not just at the level of grammar but at the level of a text, for example, if the same meaning is expressed in the text several times. For example, this is how agreement works in languages. If we take the English phrase the new car and the plural is the new cars. Here, the plurality is expressed once in the ending of the noun. And in Russian, the plurality is expressed twice: in the ending of the adjective and in the ending of the noun. That is, Russian, in this sense, turns out to be more complex than English because there is no one-to-one correspondence between the meaning of plurality and its expression in the text.

Why is all this necessary? It is clear that language is a product of evolution. It is about 100 000 years old, and if all this were some excessive complexity, it would have already been eliminated. Conversely, they sometimes emerge and they persist. It turns out that the complexity of language is somehow beneficial for the speaker and for the listener. Different aspects [of complexity] are beneficial to different participants in communication. For example, a richer variety of elements allows the speaker to produce shorter texts. For example, if a language has 8 consonants, then it usually has longer words than in a language which has 60 consonants. A good example, allowing us to illustrate this is the number system: if we write one and the same number in the binary system which has only two symbols, and in the decimal system, which has ten characters, usually the decimal notation would be about three times shorter than the binary notation. That is, a variety of characters makes it possible to produce shorter texts. The same applies to the usual irregularity [in language]. If, for example, we go back to the English plurals and look at how irregular English plural forms are constructed, we would see that irregular forms are usually shorter than as they would look if they were regular. Consider tooth: if it were something like tooths, it would have been one sound longer than teeth. Or take the word mouses, if mouse had the same plural as house: it would have been longer than mice. Thus, irregularity is another way to make the text a little bit shorter.

The lack of one-to-one correspondence between form and meaning at the level of text produces some redundancy: it can be very useful to the listener because, of course, interferences constantly occur in communication. For example, if we listen to the English phrase the new cars, and at the time of the last sound somewhere nearby there was some noise, we do not understand what is the [intended] number, and there is nothing we can do about it. If we hear a Russian phrase such as novye mašiny ‘new[plural] cars’, we can still somehow understand what is happening because we also have adjectival ending [-ye that expresses plurality], if we assume that the ending of the noun sounds minimally distinctive [i.e. if we assume that it sounds different from other possible adjectival endings, something that is not entirely true]. But even if we do not discern it, nevertheless we can still understand everything. Thus, this redundancy makes language more complex, but it is beneficial to the listener.

A single quantitative measure of complexity has not been developed. You can, of course, take different parameters: count the number of sounds, the number of cases, the number of verb tenses and so on, and try to find some single weighted measure that would take all this into account and measure what languages ​​are the simplest and which languages ​​are the most complex. Typically, that’s how it is done, although there are more non-trivial approaches. Scales of this kind usually allow us to understand what languages are ​​simpler and what languages are more complex still fairly easy (since the data about the number of cases, the number of sounds, etc. are already collected), at least to a first approximation. For example, the well-known American scholar Johanna Nichols has created such a rating. Somewhere close to the top of the complexity rating are found, for example, the Akkadian language and the Manggarai language spoken in Australia, and somewhere at the bottom of the complexity rating appear such languages as the Mixtec language spoken in Mexico, the Nivkh language spoken in the Russian Far East, and Chinese. For non-linguist this is may be a little surprising because non-linguists are usually inclined to believe that since it has characters, Chinese must be very difficult. But generally speaking, if you look at the grammatical system [of Chinese], it is easy to understand why [it is considered to be simple]. In Chinese, there is practically no morphology, which is why Chinese appeared low on the scale, and, generally speaking, if a Martian came here and began to learn Chinese without the characters, he would certainly learn it quickly enough.

Just to rank the complexity in general is, of course, not very interesting. It is more interesting to correlate complexity with other parameters. And recent studies show that the complexity of the language, this absolute linguistic complexity, is closely linked to the social situation in which the language has existed, with socio-linguistics. It turned out that simpler languages are usually languages with larger numbers of speakers, languages ​​of inter-ethnic communication, while the more complex languages are exactly the languages ​​that have fewer speakers, languages whose range of speakers is limited. And when linguists have tried to understand why that is so, it was suggested that it is explained by the fact that languages ​​that are widely spoken are learned by many people when they are adults. If it is a language like English, many of us must learn it not as children but later in life. And if it turns out that such people [i.e. adult learners] have not learned something properly—because adults usually do not learn complex linguistic phenomena very well and their period of acquiring a mother tongue has long ended—and if these adults continue to transmit the language to their children, then the language would be transferred in a simplified form, and this is precisely how major languages ​​of international communication get simplified. By the way, this contradicts the traditional idea that languages ​​with a developed culture are complex and developed, whereas tongues of one village are something primitive and simple. In fact, usually it is not the case. It is actually languages spoken in one village that are typically more complex in their structure than languages of big nations.

This is just one of the problems facing the study of linguistic complexity: how is the complexity associated with other parameters? But the most pressing problem is probably the question of how to measure complexity. And linguists do not yet have an answer to this question, but it may very well be that as we study this problem using the methods of other sciences, we will have an answer after all.

Like this post? Please pass it on: Tweet