State-of-the-art Multilingual Lemmatization

An analysis of state-of-the-art lemmatizers that work for tens of languages

I find lemmatization an intriguing NLP task, in which we must find ways to deal with the richness and some arbitrariness of human languages. In this post, I summarize the impressive current state-of-the-art for multilingual lemmatizers, with some useful hints for practitioners and newcomers to the field, and make some points on their limitations.

Stemming: low cost, low return

When we are dealing with text data, sometimes there is the need to reduce vocabulary: if you are searching for “used car prices”, documents containing both car price and price of cars are likely to be relevant.

A common strategy for this issue is stemming, the process of removing prefixes and suffixes from words until we are left with its stem, which carries most of its meaning. So, cars is stemmed to car, replayed to play, and taxi continues being taxi. It is carried out just by string editing, without any kind of preprocessing or machine learning.

Stemming has been used successfully by many search engines due to its simplicity. This same simplicity, however, is a limiting factor: there is not even a precise definition of what the stem of a word is. The stem of replay might be play, but the stem of rehearse is certainly not hearse. It makes sense to think that the stem of caring is care, not car; but to accomplish that, we would need a more complex algorithm that knows when to insert a final -e. This can get even more confusing for other languages.

The bottom line is that any method based only on string editing lacks linguistic consistency. When this is an issue, we turn to lemmatization.

Lemmatization

Lemmatization is the process of determining what is the lemma (i.e., the dictionary form) of a given word. Taking on the previous example, the lemma of cars is car, and the lemma of replay is replay itself. This is a well-defined concept, but unlike stemming, requires a more elaborate analysis of the text input.

It is reasonably simple for English (as in stemming), a language with little word inflection. But for most languages, lemmatization is both more important and difficult. See this example sentence in Czech:

Studie proveditelnosti odhaduje, že by v kabině cestujícím trvalo překročit řeku Potomac asi čtyři minuty.

Replacing each word by its lemma gives us the following:

Studie proveditelnost odhadovat, že být v kabina cestující trvat překročit řeka Potomac asi čtyři minuta.

(in case you’re wondering, it means “The feasibility study estimates that it would take passengers about four minutes to cross the Potomac River on the gondola.”)

In our Czech example, 8 out of 15 words differ from the lemmas! That’s almost all content words in the sentence. Most research in NLP uses English data, but as you can see, it is a poor language choice for exemplifying lemmatization, not reflecting its real challenges.

With the CoNLL 2017 and then 2018 shared tasks, however, we have a lot of data published for over 50 languages with lemma annotation. Participants were asked to, among other things, come up with lemmatizer models that worked for this wide range of languages. These data provide a great benchmark for evaluating lemmatizers.

Difficulties

Word inflection has a lot of regularities, which might make you think of developing a set of rules for recovering the lemma of a word. But that can be surprisingly difficult because of some factors:

Irregular forms. You’d have to treat them one by one.

Words that look like they’re inflected, but are actually not. E.g., bring is not an inflected form of the unconceivable verb to bre.

Inflected forms shared by different lemmas. E.g., in German, gehört can be a participle of hören (to hear) or of gehören (to belong), and only context can disambiguate.

The number of inflection rules might be too high, or you may not have enough knowledge of the language you’re working with.

As you can see, designing hand-written lemmatization rules can be troublesome even for English. So, as usual, we turn to machine learning for help. More specifically, to the ever-present neural networks.

How do neural lemmatizers work?

There is more than one way of lemmatizing words with neural networks, but let’s stick to the one with the best results, which also happens to be the simplest. It uses sequence-to-sequence (seq2seq) neural networks, which reads words character by character, and then outputs their lemmas also character by character. This is the model used, among others, by the Turku NLP group that had the best results on lemmatization in the CoNLL 2018.

I shall not go into details of how seq2seq works. My focus here is on the practical usage of lemmatizers; there are a lot of great posts explaining seq2seq out there. One of them amusingly calls it the clown car of deep learning because it can fit a lot of information in a seemingly small number of parameters.

An overview of the seq2seq lemmatizer architecture

The end symbol, also represented as </s>, is used to indicate when the model finished producing its output. Without it, a seq2seq would generate characters indefinitely.

The model input doesn’t need to be only words. We can also include some metadata, such as POS tags, which can help determine which is the correct lemma for ambiguous forms. Remember that each character is encoded as an embedding vector, which is learned together with other neural network parameters. So, as long as we can encode our tags in vectors of the same dimensionality, we are good to go.

Let’s see an example where POS tags are important. In Portuguese, olho can be a noun, meaning eye, or an inflected verb, meaning to look; in the latter case, it should be lemmatized to olhar.

Additional information about the inputs can help disambiguate

This allows a lot of flexibility. The model can learn that certain word forms look nothing like their lemmas (e.g., was / be), some never change (grammar words such as prepositions and conjunctions), while for most others it is a matter of changing prefixes, infixes or suffixes. It can also learn to associate certain inflection patterns with certain POS tags.

Treating ambiguity

Most current state-of-the-art implementations, however, don’t completely solve the problem of lemma ambiguity. Models such as the StanfordNLP and the Turku Parser lemmatize each word independently, taking into account its POS and morphology tags (such as gender, number, tense, case, etc.), but not the other words in the sentence. It is enough to solve cases like the example above, in which noun and verb have different lemmas, but not when the same combination of word, POS tag and morphological inflection still can be mapped to more than one lemma — like gehört, which I mentioned earlier.

By the way, do not mistake lemma ambiguity with word sense ambiguity! In lemmatization, we are only interested in the written lemma, not meaning. Thus, you don’t need to know if bats refer to the flying animal or the club used in baseball as long as you know its lemma is bat.

But why are these models like that? Basically, because these cases are very rare. In this paper, researchers from Turku argue that for most treebanks, this phenomenon happens for less than 1% of the tokens. Even for languages in which it is common, such as Spanish (14%), Hindi (22%) and Urdu (36%), there is one lemma much more frequent than the others. For example, the form fue appears dozens of times tagged as AUX (auxiliary verb) and having ser (to be) as lemma, against a few occurrences with the same tag but the lemma ir (to go).

On top of that, many of these ambiguities might be annotation errors. I checked the Spanish treebank and found that many ambiguous word/tag combinations actually had typos or were not correctly lemmatized. Unfortunately I don't know any Urdu to repeat the check, but I wouldn't be surprised to find the same.

This doesn’t mean it is not worth detecting the difference between the two — on the contrary. The problem is that with so few examples, statistical models have a hard time learning the correct lemmas. So if we really want to prepare our models to disambiguate all the fue, gehört and the like, we need to prepare datasets targeted at those words.

There are implementations capable of disambiguating them, though — at least theoretically. The UDPipe Future and the Combo parser are two of them, and despite not using seq2seq, their rationale is compatible with it. The figure below depicts the UDPipe architecture, but the important thing here is the usage of a recurrent neural network (RNN), also employed by Combo.

The lemmatizer architecture in UDPipe Future. A bidirectional LSTM encodes words in context-sensitive vectors.

The bidirectional LSTM, a common choice of RNN, reads the whole input sentence and produces context-sensitive vectors to encode each word. After that, a lemmatizer MLP classifies each word into one of the automatically generated lemmatization rules, which consist of removing, adding and replacing substrings.

This kind of representation is not exclusive to their models; it is pretty standard in NLP. The Turku and Stanford systems do have BiLSTMs of their own, but they only use them for the other CoNLL tasks (POS tagging, morphological tagging, and parsing; I intend to talk more about them in an upcoming post).

It turns out, however, that context awareness doesn’t necessarily translate into better lemmatization performance. In the CoNLL evaluation, the Turku and Stanford systems got the best results in the treebanks that had a reasonable amount of data. Even for the treebanks in which ambiguous tokens are more common, results aren’t very clear:

For Urdu and Hindi, the Turku model has the best results, followed very closely by UDPipe Future and Combo

For Spanish, UDPipe Future and Combo have the best results, with Turku next

Considering how rare ambiguous token/tag combinations are, the capability of distinguishing them played a very marginal role in the results. It is more likely that the seq2seq architecture is simply more efficient in general for this kind of task, in comparison to the classifier used in UDPipe. We still could, of course, combine the best of both worlds and design a seq2seq lemmatizer that includes a context representation in the same way as the extra tag metadata:

The context-sensitive output of an RNN can help disambiguate when even POS and morphological tags are the same

The first sentence in the figure above means I haven’t heard anything, and the second It belonged to the firm. In each case, a bidirectional LSTM would be capable of capturing the particular sentence context, providing more information to the seq2seq lemmatizer, which would, in turn, be able — at least theoretically — to predict the correct lemma. Without the context vector, the lemmatizer could never produce two different lemmas for this same word.

But again, we would hit the data bottleneck. With ambiguous examples being so rare, this improved model wouldn’t have enough examples to learn from and take advantage of its architecture.

How good are neural lemmatizers?

Now that we’ve seen this neat lemmatizer architecture, the next natural question is how good it actually is. The CoNLL 2018 scoreboard I mentioned before gives us a good overview, but these numbers should be read with some caution.

The first table, All treebanks, includes results for treebanks that have only a couple of thousand words. This is insufficient to learn proper lemmatization rules, and performance on these treebanks can have a high variance depending on random initialization.

The second table, Big treebanks only, filters out those extremely small treebanks (indeed, the big here can be misleading, as some of these treebanks are still quite small; but Not extremely small treebanks doesn’t sound like a good name). Lemmatizers trained on these remaining treebanks are more viable for production usage.

The numbers there look pretty good: they are in the high 90’s for most languages. You can also notice that although the Turku model is the best overall, for some languages other models do better. Keep in mind, though, that these values are inflated by the large number of words in every language that do not inflect at all, such as conjunctions and prepositions. Remember the Czech example at the beginning of this post? Well, Czech has a lot of inflection, but in that sentence 7 out of 15 words already looked exactly like their lemma.

At any rate, the performance of the lemmatizers I mentioned here seem great for production usage — at least if the texts you are working with look like the treebanks the lemmatizers were trained on. So here go a few more words of caution and practical hints if you need to use one of them.

External knowledge

The biggest challenge for machine learning based lemmatizers is irregular forms. That’s no surprise: they are challenging for humans as well, sometimes even in their native language, exactly because of their unpredictability. So if your model never saw in the training data that forbade is the past tense of forbid, it will maybe think that forbade sounds like a good lemma for itself.

If you really want to improve your lemmatizer’s performance, having a backoff list of irregular words in the language you’re working with is a good idea. However, this list should have tuples of (inflected form, POS tag, lemma), or you may run into ambiguity issues. While word/POS/lemma ambiguity is very rare, as I mentioned before, only word/lemma is an issue even in English: think of the noun-verb ambiguity in saw, thought, shot, for example.

Unseen inflections

Now, some inflection rules might be perfectly regular, but if they never appear in the training data, no machine learning based system will ever learn them. Since many corpora in the UD datasets come from newspaper texts, a common feature is that verbs in the second person are pretty rare overall. In the German training treebank, for example, with almost 30 thousand verbs, only 13 are in the second person. One of the Finnish treebanks, roughly the same size, is slightly better with 462 — but that’s only 1,5% of the total!

So, if you use a lemmatizer trained on CoNLL data and notice it performs very badly in some specific inflection, it’s likely that it didn’t have enough examples to learn. If that is an issue, you will have to either come up with more representative data to retrain the model or implement some manual rules to override the automatic output.

Conclusion

I've shown that state-of-the-art lemmatizers have developed very interesting and efficient architectures. Cases with ambiguous combinations of word/POS tag/morphological tags are still an issue, but they are so rare as to be hardly a concern in practice. Dealing with irregular words and inflections absent in the training data is more worrying, though.

Still, if you speak some language with rich morphology, try one of the systems I mentioned here just to be amazed at their capacity of undoing inflections!

The script I wrote to check the Spanish UD treebank is available at Github. You can use it to check other treebanks for ambiguous lemmas and possible annotation errors.