« previous post | next post »

On the web site of the Proceedings of the National Academy of Sciences, in the "Early Edition" section, is an article by Mark Pagel, Quentin D. Atkinson, Andreea S. Calude, and Andrew Meade: "Ultraconserved words point to deep language ancestry across Eurasia". The authors claim that a set of 23 especially frequent words can be used to establish genetic relationships of languages that go way, way back — too far back for successful application of the standard historical linguistics methodology for establishing language families, the Comparative Method. The idea is that, once you've determined that these 23 words are super-stable (because they're used so often), you don't need systematic sound/meaning correspondences at all; finding resemblances among these words across several language families is enough to prove that the languages are related, descended with modification from a single parent language (a.k.a. proto-language).

This is the latest of many attempts to get around the unfortunate fact that systematic sound/meaning correspondences in related languages decay so much over time that even if the words survive, they are unrecognizable as cognates (sets of words descended from the same word in the parent language). This means that word sets that have similar meanings and also sound similar after 15,000 years are unlikely to share those similar sounds as the result of inheritance from a common ancestor; if they were really such ancient cognates, they would almost surely not look much alike at all. (See "Scrabble tips for time travelers", 2/26/2009, for a discussion of some earlier work.)

I'm not qualified to judge Pagel et al.'s statistics, although I remain skeptical of their basic claim that words that haven't been replaced often in a handful of language families with vastly different time depths can be predicted to be super-stable in all language families. But there are problems with their premises in this article, in which their goal is to compare words from seven different language families and to show that, according to their statistics, all seven should be grouped together into a single super-family. I think they have a serious garbage in, garbage out problem.

Pagel et al. used their statistical method to compare reconstructed words for the seven language families they identify: Altaic, Chukchi-Kamchatkan, Dravidian, Eskimo, Indo-European, Kartvelian, and Uralic. One problem is that Eskimo is not a language family; it's part of the Eskimo-Aleut language family, and any effort to find deeper genetic relationships for Eskimo that doesn't take Aleut data into account is not likely to be useful.

A more serious problem is that Altaic is at best highly controversial as a proposed language family. The hypothesized Altaic family comprises three well-established families — Turkic, Mongolian, and Tungus — plus Korean and Japanese. It's a very old idea, but efforts to provide convincing evidence that all these languages belong in a single Altaic family have failed to convince most specialists. A prominent recent exchange appeared in the journal Diachronica (2004, 2005), starting with Stefan Georg's devastating review of Sergei Starostin et al., Etymological dictionary of the Altaic languages, and continuing with Starostin's reply and Georg's reply to the reply. In his reply, Starostin commented plaintively that he had hoped `that the publication of more than 2000…Altaic etymologies would put an end' to the dispute about whether an Altaic language family exists. To this Georg responds, though not in these words, that 2000+ unconvincing etymologies do not add up to any convincing etymologies at all.

In his review, Georg criticizes Starostin et al. for erroneous reconstructions of words in the individual language families and for a very loose standard of semantic "matching". The latter may be the most common criticism of word comparisons in efforts to establish very distant linguistic relationships; the other major criticism is a very loose standard of phonetic "matching". Given enough semantic and phonetic latitude, it's possible to amass a large number of "matching" sets of words for any set of two or more randomly selected languages. (If you don't believe me, try it: take bilingual dictionaries and search for similar-looking words that have vague semantic connections. It's an easy exercise.)

So I went to the website from which Pagel et al. got their data, the Languages of the World Etymological Database, and checked their 23 words in the Altaic database, which is presumably derived from Starostin et al.'s three-volume etymological dictionary. Only two of the 23 words have a single "Proto-Altaic" etymon each in the database, `what' and `spit (verb)'. All the others (except perhaps `I', `we', and `ye', which I couldn't find due to problems with the search function) have 2-7 "Proto-Altaic" forms each, and at least nine of the words have five or six each. How did Pagel et al. decide which "Proto-Altaic" word to compare to their other six reconstructed proto-languages? They apparently examined all of the possible words for each translation, e.g. five "Proto-Altaic" words for `that', four for `hear', 5 for `flow', 4 for `hand', and so forth; they then chose just one proto-word for each meaning, namely, the one `that the LWED proposed as cognate between language families', and used that one for their statistical analyses. This is a puzzling procedure, for two reasons. First, the Altaic database (and the Indo-European database too, and perhaps others as well) often lists more than one proto-word as cognate with words in some of the other six proposed language families. Pagel et al. do not say how they decided which set of putative cognates to select. Second, while acknowledging that linguists often `propose more than one proto-word for a given meaning', they observe that these proposals `can reflect synonyms in the proto-language or, more likely, uncertainty as to which of the words used among a language family's extant languages are most likely to be cognate to the ancestral word.' But if they believe (erroneously!) that synonyms are unlikely in proto-languages, and that apparent synonyms probably reflect linguists' uncertainty, how can they be confident that any selection from one of several options for a given meaning for the proto-language is the genuine one and only word for that meaning in the proto-language? What does this indeterminacy do to their claim that words for certain meanings are super-stable, unlikely to be replaced over thousands of years? And doesn't it introduce an element of circularity into their statistical calculations when they choose the set of proto-words to be compared according to its putative match with other language families and not according to an independent criterion?

There are other serious problems too. Unlike Altaic, most of the other families in the LWED databases are genuine language families. But if the "Proto-Altaic" reconstructions are representative of the quality of the reconstructions for the established families, it would be rash to rely on them. This is in spite of the fact that some of the reconstruction databases (e.g. Indo-European and Dravidian) are based on standard etymological dictionaries. The "Altaic" database contains variables in numerous reconstructions — usually V for an unspecified vowel, but also optional and alternate consonants — that make phonetic "matching" even easier (and therefore less reliable). This is a feature of many reconstructions carried out by people engaging in long-range comparison of languages, including efforts to establish a Nostratic super-family. In at least some of the individual LWED databases, the reconstructions based on standard sources have been `revised and significantly modified' (quoting George Starostin, Dravidian database) by others, and those others are believers not only in Altaic but in the super-family Nostratic. Reconstructions carried out by true believers in Nostratic are all too likely to be influenced by knowledge of words with vaguely similar meanings and/or forms in other proposed Nostratic languages — namely, in the LWED databases, the seven families compared by Pagel et al.

I also checked Pagel et al.'s supposedly super-stable words in the LWED's Indo-European (IE) database. One notable fact is that, of these 23 words, English retains only 6 or 7, assuming that the LWED's database is accurate — a fact that might be expected to limit Pagel et al.'s confidence in the reliability of their 23 words as an indicator of genetic relatedness. The count for English depends in part on whether the IE database has accurate reconstructions — `spit' in particular is dubious, because this IE database disagrees with the Oxford English dictionary (OED) here and the sounds don't match well enough to be convincing. I haven't checked all of the relevant LWED etymologies, but it looks there's a reasonable Proto-Indo-European etymology for the English words give, man, mother, fire, flow, and worm, in their current meanings.

The IE database has a sizable number of eyebrow-raising etymologies; like the database for "Altaic", it does not inspire confidence, although there is of course no question about the relatedness of the IE languages. There are many variables in the reconstructions, and many the forms themselves often bear little resemblance to mainstream Indo-Europeanists' reconstructions. The semantic looseness is often extreme. For instance, the database glosses a reconstructed form *(a)den@gh- (where @ = schwa) as `to reach, to seize, to have time'. Among the proposed descendants of this form are a Tocharian B form meaning `rise, raise oneself up', an "Old Indian" (Sanskrit?!) form meaning `reach, strike', an "Old Greek" (Ancient Greek?!) form meaning `with the teeth, biting together', and an Old Irish form meaning `repress, oppress, suppress, crush, put down'. This is typical of the semantic latitude. Formally, too, there are problems. The proposed "Old Indian" descendant of this proto-word is given as daghnoti, possibly on the assumption that the nasal of the reconstructed root metathesized with the gh; but the nasal of the Sanskrit form is a present tense suffix, not part of the root at all. So Sanskrit (by whatever name) doesn't match the database's proto-word phonetically.

If the reconstructions used by Pagel et al. for their statistical analyses are not reliable in either form or meaning, then the statistical results of comparing these reconstructions cannot provide any evidence for distant relationships among the seven groups they compare. If the selection procedure for choosing among several candidate proto-words to use for the statistical analysis is flawed, then there may be problems with the statistics as well. But even if there are no statistical flaws, the Pagel et al. paper is yet another sad example of major scientific publications accepting and publishing articles on historical linguistics without bothering to ask any competent historical linguists to review the papers in advance.

There is a larger moral here too. Early in their paper, Pagel et al. report, correctly, that after 5,000-9,000 years, `most words are thought to suffer from too much semantic and phonetic erosion to allow secure identification of true cognates', in particular (though they don't emphasize this point) because of the decay and loss of `the sound and meaning correspondences…which are thought to indicate that they derive from common ancestral words.' The authors intend their statistical method to provide evidence for relatedness of languages that are beyond the reach of the Comparative Method. Like other long-rangers with dreams of discovering bigger and bigger family groupings — maybe even the ur-human language, what the late Joseph Greenberg called Proto-Sapiens — Pagel et al. believe that abandoning the one method that is known (not just "thought") to be reliable can achieve the goal. But you still can't make a silk purse out of a sow's ear.

Update — also see Asya Pereltsvaig and Martin Lewis, "Do 'Ultraconserved Words' Reveal Linguistic Macro-Families?", GeoCurrents 5/10/2013.

Permalink