I have a new paper out today in the Proceedings of the Royal Society B (Biology). It’s based on a project that I’ve been working on for some time now (most of it, though not this paper, is joint work with my student Tyler Lau). I’ve been aware of the Tasmanian wordlist data in Plomley (1976) for many years, of course, but it was only after getting more familiar with computational phylogenetics that ideas for work with the dataset came up.*

The paper has some (now) fairly standard phylogenetic analyses using tools that will be familiar to most people who know this area. NeighborNets are now a common sight in historical linguistics, and Bayesian frameworks for tree building are also increasingly well known (if not always accepted). But admixture models are less well known, so let me explain a bit about them here. One criticism commonly leveled at work that straddles evolutionary biology and linguistics is that the tools are adopted wholesale, without regard for whether they are appropriate for analyzing linguistic data; I would like to avoid that claim here.

STRUCTURE is a clustering algorithm (and associated software). It’s designed to solve the problem of how to assign individuals to groups when we don’t know what features characterize each group. For example, say we have a new wordlist of a European language. It’s easy for us to tell whether that wordlist is of a language we know already (and whether it’s French, German, English, or Russian, etc), or whether it comes from a language we’ve never seen before. That’s because we have a lot of high quality, independent data from those languages. In the case of Tasmanian data, however, we don’t have those independent sources. We don’t know in advance how to assign wordlists to languages.

And it gets more complicated. Not only do we not know how many languages are represented in the dataset, we also have a suspicion that at least some of the wordlists may contain words from more than one language. For example, the “Norman” vocabulary represents material recorded over a number of years from many different people. Other lists were recorded on the Flinders Island mission, where there were Tasmanians from all over the island. That means that similarities between two wordlists could be due to them belonging to the same language, or it could be that only part of the list belongs to the same language.

This is where STRUCTURE comes in. We need a way to simultaneously infer the number of meaningful groups represented in the data, and what the signals of each group are, and STRUCTURE provides that. STRUCTURE was designed to work with genetic data; it uses allele frequencies at unlinked loci to assign individuals to populations. What does that mean? Genes have variants, called alleles. Some variants are much more common in some populations than others, and so we can use the information on those frequencies to work out what the frequency signatures of different groups look like if we look at lots of different genes.

In the Tasmanian case, instead of looking at genes or allele frequencies, I’m looking at words and translation frequencies. The ‘words’ in English are the genes and the ‘translation frequencies’ are the equivalent of alleles in the original model. STRUCTURE makes some assumptions about the data it is grouping. For example, it assumes that each locus of sampling is independent (that is, that the alleles aren’t conditioned by one another). This assumption holds for the language data too. Changing your word for ‘hand’ doesn’t also mean you need to change your word for ‘water’: they vary independently. STRUCTURE also assumes that the genetic data are in Hardy-Weinberg equilibrium. This assumes that the proportion of allele frequencies are constant across generations (that is, that the loci chosen for study are not under selection pressures). This is a controversial assumption in genetics, and lexical selection pressures are not widely studied in historical linguistics in a comparable way. What this most likely means for the data analyzed here is that we should treat the inferred clusters as synchronic groups only: that is, we can’t use STRUCTURE to infer the number of languages or families in the data—but there are other tools for that. Once we’ve identified the wordlists with mixed data, we can exclude them from initial analysis and use the remaining wordlists to study the number of languages and families represented in the dataset.

That’s where the tree-building and network algorithms come in. SplitsTree has an implementation of the NeighborNet algorithm, which is very useful both for inferring language clusters and for detecting conflicting signal. For building a tree, I used BEAST. The tree allows some provisional dating and gives an idea of the degree of confidence in higher-level clusters. In this case, I see little (if any) evidence for a single Tasmanian family. Remember that the tree-building programs build a tree out of all items in the analysis, so the fact that all nodes are linked in a tree doesn’t mean that all the languages are related.

That’s where analysis of the cognate data comes in, and where we return to the comparative method. The supplementary materials provide some discussion here. Arguments about cognacy and similarity ultimately come down to the judgments of the linguists involved. In the supplementary materials, I take all 26 (yes, 26, out of more than 3000) words which Plomley judged to belong to the same word-family and find problems with just about all of them. Most are either data errors or clear loanwords.

I hope this paper will show some of the ways in which computational tools can be useful in historical linguistics. I also hope that it encourages linguists not to give up on sketchy data. Study of Tasmanian languages has been written off because the data are too messy to work with, and indeed, it is a fragile data set. 18th Century Tasmania is pretty high on the list of places that linguists will want to visit when time machines are invented. However, I’ve shown here that we can get more information about the languages than we’ve previously assumed. Now that we have internally coherent information about the language clusters on the island, we can start to identify systematic differences between the clusters. Tyler Lau and I have projects in progress on this topic.

*Part of working on Tasmanian involved digitizing all the wordlists in Plomley (1976). I haven’t made the file generally available yet, because I am still waiting to hear back from Brian Plomley’s literary executors about whether it is ok to do so.