A couple of years ago I wrote a blog post about Professor Phillip M Parker PhD, a Professor of Marketing in France who had established a website called Webster’s Online Dictionary that contained materials on endangered languages taken from copyrighted sources. Parker also published a set of books based on materials taken from copyrighted websites, such as Webster’s Kamilaroi-English Thesaurus Dictionary.

Well, it looks like someone else is also harvesting data on languages from copyrighted sources without attribution. This is the PanLex project funded by the Utilka Foundation that:

“gathers knowledge about all the words in all the languages of the world, so that any word may be translated into any language, a step toward panlingual communication. For this work we consult multilingual, bilingual, and monolingual resources named “dictionaries”, “thesauri”, “lexical databases”, “wordnets”, “glossaries”, “terminologies”, “vocabularies”, and “word lists”, as well as individuals.”

Although the website gives a list of “the resources we are now consulting”, a simple search using the TerraDict tool shows that in fact unlisted materials are also being used. I searched for “left” in the Dieri (Diyari) language (which I have worked on for the past 35 years) and got the following result (click image to enlarge it):

This can only have come from the vocabulary list in my 1981 book A Grammar of Diyari, South Australia (published by Cambridge University Press) because it is only in that book that I used the letter d for the trill sound — in later publications I used rrh. This word would now be spelled as warrangantyu in the orthography that the Dieri Aboriginal community prefers. QED, the Diyari material has been nicked without attribution from my copyrighted book.

Johnathan Poole, President of the Utilika Foundation, realises they are playing fast and loose here as the following statements from the minutes of their 2011 Annual Meeting held just last month make clear (note the last sentence in particular):

“intellectual-property obstacles to the expansion of PanLex have not yet been a major problem. If they prevented us from using one resource, we could move on to the next. The creators of many resources assert rights that, taken literally, would prohibit a person reading a resource from later making use of what he or she had learned from it. From the beginning of the project, I have considered such usage prohibitions unenforceable, and I have considered our use of any resource to be the recording of facts asserted by it, in a novel form, not the creation of a copy of it and thus not copyright infringement. … I believe that our normalization, structuring, and selective use of published data, combined with our provision of links to the original data, will satisfy most content creators. However, the inclusion of funds for legal services in the 2012 budget reflects an assumption that intellectual-property issues, as well as contractual issues more generally, will likely become more complex as resource deployment progresses.”

Well, as far as I can see there is no “complex[ity]” surrounding “intellectual-property issues” here — the Diyari materials (and possibly lots more on lots more languages) are copyright and subject to fair dealing. Anything else is theft.

PS: Thanks to David Nathan for passing on pointers to the PanLex project, including the Annual Meeting minutes quoted here. He bears no responsibility for the content of this blog post.

Notes