What's the generic term for a carbonated beverage?

Your answer to this question will be determined by where you live. Answers vary wildly, from soda, pop, and coke (even if the soda pop is not Coca-Cola) in various states of the US to 'soft drink' or 'fizzy drink' on the other side of the Atlantic.

So it will come as no surprise to anyone who has travelled and tried to order a carbonated beverage that a language can vary from region to region. Similarly, anyone who knows how 'to google', heard a 'vuvuzela', used 'the grid' or witnessed a 'bromance' will know that a language can change over time as well; new words regularly appear in our vernacular, words change their meaning and old words slowly die away.

Dictionaries try to keep on top of changes; around 2,000 new words were added to the New Oxford American Dictionary in 2010 alone.

But beyond this there is no comprehensive, systematic analysis of vocabulary or word formation - why do some words thrive, while other combinations of letters will never be accepted into a language?

An interdisciplinary team from the University of Würzburg in Germany is addressing this problem by building a 'metadictionary' containing the core units of all the words used in the last 500 years in the German language, including those specific to regional dialects. The goal, the researchers said, is to develop and test methods and algorithms for detecting and understanding variance.

The core units of words, the smallest part that conveys any meaning, are called 'morphemes'. For example, the word 'craftsmanship' can be broken down into three morphemes: craft + a 'gap element' s + man + ship, with each morpheme modifying the meaning of the total word.

When the researchers chopped up the words into their morphemes, they built a network where each morpheme was a node and each connection denoted that two morphemes were adjacent in a word. (See image, below)

"Not surprisingly, all networks showed the signposts of small world, scale free, hierarchical networks," said Joerg Schultz from the University of Würzburg.

"A key feature of such networks is the existence of highly connected nodes called hubs. When we compared these nodes between the networks, we identified a change over time, that is, from 15th century over the 18th to today, but also between different regions at the same time. Thus, we could recall cultural changes by analyzing the morpheme networks," Schultz said.

New words begin at the edge

The researchers also looked at how the morphemes connected in words - and how these connections were gained and lost over time. They also saw how a new meaning is integrated into a language.

"We found that evolution of a meaning usually happens at the border of the network. Still there were exceptions, morphemes which were lost from the language though highly connected as well as morphemes which 'invaded' the language," Schultz said.

The team working on the project includes computer scientists and linguists, unsurprisingly, but Schultz is actually a biologist. "For us as biologists, the morpheme concept was quite familiar as there is a similar structure in biology: a protein is frequently composed of more than one functional part. These so-called domains can be seen as analogous to the morphemes which compose words," said Schultz.

The team has already seen that models for understanding variance in language are similar to those that describe variance in genomics.

"Preliminary tests have shown that the corresponding networks have similar properties. This could be due to the fact that the generative processes behind evolution of language and genome might be comparable," Dietmar Seipel, the computer scientist on the project, reported earlier this year at the International Symposium on Grids and Clouds and the Open Grid Forum in Taipei, Taiwan.

Constructing a metadictionary

The researchers used publically available digitized dictionaries, such as the Middle High German Dictionaries, the early High German Dictionaries and dictionaries on regional dialects, and developed methods to transform the information from the dictionaries into code, while retaining all information, such as the part of speech, gender and inflectional detail. Analyzing different dictionary entries for sentence structure and parts of speech is difficult, because "there is a lot of structural variance," Seipel said.

Nevertheless, the team managed to develop an annotation tool, using the declarative programming language PROLOG, so that keywords and other data could be extracted from dictionary entries. The annotated dictionary entries are stored in XML format, in accordance with the guidelines for electronic text encoding and interchange.

"We can apply the results of our dictionary analyses in a second step to text corpora of Middle High German texts and early new High German texts - starting with Luther and the mass of German literary texts - available soon in the TextGrid digital repository," the linguist on the project, Werner Wegstein, said. TextGrid is technically oriented long-term archive embedded in a grid infrastructure.

From this, "we expect new insights into the combinability of basic units, qualitatively as well as quantitatively, because dictionaries of the German language do not register complex morpheme structures, they only list entries with complex structures showing specific additional semantic features," said Wegstein.

For example, a German dictionary might not explain the regular compounds, such as 'haus-dach' meaning rooftop, while including the irregular compounds such as 'haus-tür' meaning house-door, or the door by which people enter a house, specifically.

The researchers' algorithms for analyzing texts and detecting variance can likely be applied to other languages as well, because morphemes are "common in all Indo-European languages and in one way or another in other types of languages as well," said Seipel.

The team has plans to use the British National Corpus and Dr Johnson's Dictionary of English from 1755 in a future project. "It would be interesting to see how the relations we research have developed in English."