Simple questions often yield complex answers. For instance: what is the difference between a language and a dialect? If you ask this of a linguist, get comfortable. Despite the simplicity of the query, there are a lot of possible answers.

The distinction might depend on one’s point of view. From a political perspective, a language is simply that which is standardly spoken by a nation. From about 1850 to 1992, for instance, there was a language known as Serbo-Croatian, which had several dialects including Serbian, Croatian and Bosnian. But since Yugoslavia dissolved into several independent countries in the mid-1990s, those dialects have come to be recognised as distinct languages. This political definition works to some extent, though it poses more problems than solutions: there are languages that extend across different countries, notably Spanish in Latin America. Nobody would claim that Mexican Spanish and Columbian Spanish are different languages. Perhaps Spanish, as spoken in some parts of Spain, is different enough from the Latin American varieties that it deserves to be called a separate language, but that isn’t clear.

Perhaps the distinction between language and dialect can be made in terms of mutual intelligibility? Unfortunately, there are immediate problems with this approach. A Dane will understand Swedish somewhat better than a Swede will understand Danish. Similarly, someone speaking a peculiar, rural dialect of British English will understand an American from Los Angeles far better than the other way around. Mutual intelligibility often depends on exposure, a fairly uncontrollable variable, rather than anything intrinsic to language.

So perhaps we need to take a more purely linguistic approach. Imagine that we could measure a difference, D, between two speech varieties in a systematic way. Then we could let a certain value of D define the cut-off between what would be two dialects and two languages. Such a measure should be attainable since there are lots of things to compare between two languages, such as their sound inventories, grammatical characteristics or lexicon.

Also Read: Things Fall Apart: Chinua Achebe and the Languages of African Literature

But what if the differences between speech varieties are gradual, such that the probability of finding a given value of D is as high as finding some other value? We would then have to choose an arbitrary value of D as our cut-off point, and an arbitrary value would throw us back into considerations of a political or practical nature, where we don’t want to go. Do we want our cut-off to lie at a level where Serbian and Croatian are the same or different languages? If we want to catalogue the world’s languages, how many thousand languages do we want to pigeon-hole: 5,000? Or 7,000? Or maybe 10,000?

Recently, two major obstacles in distinguishing language from dialect have been overcome. The first is how to measure differences between speech varieties – finding a value for D. In 2008, a number of linguists came together to form the Automated Similarity Judgment Program (ASJP), of which I am the daily curator and a founder. The ASJP painstakingly assembled a systematic, comparative dataset of languages that now contains 7,655 wordlists from what would be two-thirds of the world’s languages, if we assume for our purposes that languages are defined as in the ISO 639-3 code standard. Since each wordlist contains a fixed set of 40 concepts and are transcribed in a uniform manner, they can easily be compared, and a measure of difference can be obtained. The measure of the difference between two words that has become most used is a version of the Levenshtein distance, named after Vladimir Levenshtein, a Soviet computer scientist who in 1965 devised an algorithm to compare two strings of symbols. He defined ‘distance’ as the number of substitutions, insertions and deletions needed to turn one string into the other. The Levenshtein distance can usefully be divided by the length of the longest of the two strings, because this puts all the distances on a scale from 0 to 1. This has become known as the normalised Levenshtein distance, or LDN.

The second obstacle is that perhaps ‘language’ and ‘dialect’ are concepts can that be defined only arbitrarily. Here, there is some more promising news. If we look at all the language families in the ASJP database for which database contributors have included a healthy portion of close varieties, we can begin to look for different behaviours of languages and dialects. An intriguing picture emerges: the distances tend to hover around either a relatively small value or a relatively large one, with a valley in between. As it turns out, the valley tends to lie in a narrow range around a mean of 0.48 LDN. Without losing significant precision, we can say that speech varieties tend to not be halfway similar in their basic vocabulary. Either they will tend to be more similar, in which case they can be defined as different dialects, or less similar, in which case they can be defined as different languages. Herein lies the distinction between language and dialect.

The phenomenon is probably a result of social circumstance. Dialects will drift apart as people settle in new places and shape new identities but, if there is still some contact, convergence can also be present so that speech varieties remain less than halfway similar (and therefore the same language). A small push in the direction of divergence, however, might cause the varieties to drift apart relatively rapidly, raising their Levenshtein distance, thereby qualifying them as distinct languages. Possibly there is a connection between the cut-off for distances between words on the standard list used by ASJP and corresponding distances in other parts of language structure that make for a point of serious loss of mutual intelligibility. In other words, the threshold for mutual intelligibility might correlate with the threshold between languages and dialects. We don’t know that yet, but it’s something to look into.

Also Read: The Languages of Delhi – A Microcosm of India’s Diversity

Having come up with an objective and non-arbitrary criterion for separating languages from dialects, we can apply it to the world’s languages. Some pairs of speech varieties that are considered national languages, such as Bosnian and Croatian, fall way below the cut-off of LDN = 0.48 (the same language, regardless of Yugoslavia’s existence). Some fall not far below it, such as Hindi and Urdu (different languages, barely). And varieties of Arabic and Chinese, both of which are often thought of as single languages, soar above LDN = 0.48 (the varieties are themselves different languages). Indeed, there are a few pairs of varieties that are normally considered distinct languages but which are on the borderline: Danish and Swedish, for instance, score LDN = 0.4921.

Finally, a technique derived from the datasets, called ASJP chronology, can be applied to establish the amount of time it takes for dialects to drift far enough apart to qualify as separate languages. The answer we have found, ignoring some margin of error, is 1,059 years. These findings can be corroborated by looking at how long it typically takes for an ancestral language of a language family to break up into daughters that subsequently become ancestors of subfamilies. This requires other techniques, but the results are similar: it takes about a millennium for dialects to become languages. We know this because we can now distinguish the two.

Søren Wichmann is a Danish linguist affiliated with Leiden University in the Netherlands, Kazan Federal University in Russia, and Beijing Language University in China.

This article was originally published at Aeon and has been republished under Creative Commons.