Figuring out Rules for German Noun Genders with Simple Machine Learning and Statistics

2019/04/21

When learning German, one of the most confusing features of the language is the noun gender system. In German, every noun has one of three genders (masculine/feminine/neuter), but unlike many other languages, these genders are seemingly not assigned based on any logical rule. Despite this, native German speakers as well as experienced German students are able to intuitively “guess” noun genders correctly. This led me to the logical conclusion that some underlying rules must exist. Furthermore, if humans can have an intuitive model of these rules, perhaps we can create a computer-based model, and figure out what these rules actually are!

This post is about my initial exploration of modeling German noun genders with some simple machine learning and statistics. I attempt to find some rules that might aid German learners (including myself) in figuring out noun genders.

I'll try to strike a middleground between technical details and accessibility. If you'd like to skip all the technical stuff, you can jump to the juicy German gender rules!

What might the rule be?

The first step was to guess at what such a rule might be. When learning German, you're taught about certain suffixes which allow you to guess that noun's gender. For example, nouns ending in -keit, -heit, -ei, -ie, -ion are always feminine, while nouns ending in -ant are usually masculine, and those ending in -chen, -lein, -li are always neuter.

To generalise this, I decided to investigate the correlation between the last syllable of a word and that word's gender.

Gathering data

['Gegenwörter', 'pl', 'Ge-gen-wör-ter'], ['Gegenzahl', 'f', 'Ge-gen-zahl'], ['Gegenzahn', 'm', 'Ge-gen-zahn'], ['Gegenzauber', 'm', 'Ge-gen-zau-ber'], ['Gegenzeichen', 'n', 'Ge-gen-zei-chen'], ['Gegenzeichnung', 'f', 'Ge-gen-zeich-nung'], ['Gegenzelle', 'f', 'Ge-gen-zel-le'], ['Gegenzeuge', 'm', 'Ge-gen-zeu-ge'], ['Gegenzinnenbalken', 'm', 'Ge-gen-zin-nen-bal-ken'], ['Gegenzugkraft', 'f', 'Ge-gen-zug-kraft'], ['Gegenzugrollo', 'n', 'Ge-gen-zu-grol-lo'], ['Gegenzug', 'm', 'Ge-gen-zug'], ['Gegenäußerung', 'f', 'Ge-ge-n-äu-ße-rung'], ['Gegenöffentlichkeit', 'f', 'Ge-gen-öf-fent-lich-keit'], ['gegenüberliegende Seiten', 'pl', 'Sei-ten'], ['gegenüberliegende Seite', 'f', 'Sei-te'], ['gegenüberliegendes Gebäude', 'n', 'Ge-bäu-de'], ['Gegenübernahmeangebot', 'n', 'Ge-gen-über-nah-me-an-ge-bot'], ['Gegenüberstellung', 'f', 'Ge-gen-über-stel-lung'], ['Gegenübertragung', 'f', 'Ge-gen-über-tra-gung'], ['Gegenüberwachung', 'f', 'Ge-gen-über-wa-chung'], ['Gegenüber', 'n', 'Ge-gen-über'], ['gegerbtes Leder', 'n', 'Le-der'], A short snippet of our noun list

I realised I needed the following:

A list of lots of commonly occuring German nouns

A way to find the gender of all those nouns

A way to split those nouns into syllables

For the noun list, I found a corpus from the Institute for the German Language in Mannheim called “DeReWo – Korpusbasierte Grund-/Wortformenlisten”. This provides a large list of words, complete with information on how common each word is, as well as the part of speech for every word. It was then trivial to single out common nouns and sort by frequency.

The above corpus does not, however, include the genders of those nouns. To get genders, I used the dict.leo.org database. This website is a fantastic reference for German, and they are kind enough to provide a download of their database under certain restrictions. After some quick work parsing the database, which is simply a text file, I looked up the genders of all words from the noun list.

For separating words into syllables, I used the excellent pyphen library.

At the end, I was left with tuples of the form (word, gender, syllables) , for example ('Regel', 'f', 're-gel') , as well as a list of nouns ordered by frequency.

Can a computer guess noun genders? Confirming our theory

=== Running cross-validation >>> 454853 nouns (samples) 201563 f 158722 m 94566 n 1 sg 1 sg. >>> Running 10 splits > Fold 0, accuracy = 88.95% > Fold 1, accuracy = 87.13% > Fold 2, accuracy = 89.29% > Fold 3, accuracy = 87.74% > Fold 4, accuracy = 87.97% > Fold 5, accuracy = 88.61% > Fold 6, accuracy = 88.41% > Fold 7, accuracy = 88.05% > Fold 8, accuracy = 88.63% > Fold 9, accuracy = 89.7% > Average accuracy: 88.45% >>>>> Finished run: (1400, 0.7, 50, 10) 88.45% A samples of our testing output

With all the data in place, I wanted to first figure out if a strong correlation between the last syllable of a noun and its gender exists at all. To do this, I used a decision tree classifier, as implemented in scikit-learn.

Why a decision tree? Well, some classifiers' internal models are easier to understand than others. Since I was less interested in classifying nouns, and more interested in understanding how such a classification is naturally done by humans, I needed a classifier I could look into, to understand how it's doing its job. A decision tree is well-suited for this. As for the data itself, I gave samples a single feature — the last syllable of the word.

To test the accuracy of this classification, I used a simple k-fold cross validation. So, was the last syllable of a noun enough to guess its gender? *drum roll*

Pretty much! In all cases, the model guessed genders right around 88% of the time. It looks like we're on the right track here!

Since I'm interested in better understanding how German works, it was now time to analyse the decision tree, to figure out how our program was making these decisions. While the decision tree's model confirmed my suspicions, I quickly gave up on this approach, since it was much simpler to do some quick statistics over the suffixes now that I knew the problem was feasible, so that's how I went on.

Choosing the most useful suffixes

Suffix Example Reliability Nr. words -on die Funktion 95.98% feminine 10,626 -tät die Universität 100% feminine 2,513 -rer der Maurer 100% masculine 1,622 -men das Unternehmen 79.92% neuter 1,589 A few of the suffixes mentioned

Of course, our classifier has created quite a complex model of all of these suffixes. As humans, while we may have just as complex of a model in our heads, we can only intentionally learn and remember a few of these rules.

I therefore decided to look for the suffixes that made most of a difference in the classification. Simply put:

For each suffix, check how reliable it is. A suffix of –tät indicates with 100% confidence that the noun is feminine. However -men only indicates the noun is neuter around 80% of the time, which means that 20% of the time the word can actually have a different gender.

For each suffix, also check how many words appear with that suffix. 100% of nouns ending in -rer are masculine, but there are only 1622 of them. On the other hand, the suffix -on can tell us the gender of 10,626 words, with 95.98% accuracy of being feminine.

Weigh these two indicators together such that we get the best compromise between reliability and frequency, to get a “usefulness” score.

Sort all suffixes by this score.

Results

I chose the top 50 most useful suffixes, and made a table out of them. At least when it comes to suffixes, these are the rules that will help German learners guess genders correctly most often.

Without further ado, the results:

Suffix Example Gender Reliability Nr. words with suffix -on die Funktion Feminine 95.98% 10,626 -te die Ernte Feminine 96.39% 10,292 -se die Achse Feminine 91.96% 9,671 -rung die Störung Feminine 99.94% 8,853 -le die Schule Feminine 95.08% 7,962 -tung die Leistung Feminine 99.92% 6,352 -ge die Menge Feminine 85.4% 7,430 -keit die Höflichkeit Feminine 99.92% 5,210 -lung die Vorstellung Feminine 99.94% 5,085 -ne die Birne Feminine 96.93% 5,180 -ler der Maler Masculine 99.54% 4,343 -chen das Mädchen Neuter 91.5% 4,716 -de die Wunde (not das Gebäude) Feminine 80.74% 5,238 -be die Scheibe Feminine 86.62% 4,640 -rin die Holländerin Feminine 96.04% 4,167 -ger der Staubsauger (not das Lager) Masculine 91.96% 3,918 -gung die Bewilligung Feminine 99.92% 3,533 -nung die Ahnung Feminine 99.9% 3,132 -dung die Entzündung Feminine 99.71% 3,100 -re die Himbeere Feminine 94.24% 3,211 -heit die Schönheit Feminine 99.83% 2,917 -pe die Klappe Feminine 98.11% 2,851 -ner der Türöffner Masculine 98.8% 2,751 -che die Wäsche Feminine 97.18% 2,588 -tät die Mobilität Feminine 100.0% 2,513 -cke die Decke Feminine 99.67% 2,436 -mus der Kapitalismus Masculine 98.89% 2,428 -gel der Engel Masculine 80.75% 2,857 -ze die Kerze Feminine 95.83% 2,397 -fer der Pfeffer Masculine 82.32% 2,703 -schaft die Freundschaft Feminine 97.38% 2,252 -me die Blume Feminine 88.46% 2,469 -tur die Agentur Feminine 99.54% 2,194 -ling der Säugling Masculine 95.29% 2,080 -tem das System Neuter 99.01% 1,919 -der der Salamander (not das Leder) Masculine 79.22% 2,372 -ment das Dokument (not der Zement) Neuter 94.23% 1,837 -tik die Gymnastik Feminine 99.0% 1,695 -um das Eigentum Neuter 99.58% 1,649 -sung die Überweisung Feminine 100.0% 1,636 -rer der Maurer Masculine 100.0% 1,622 -ren das Verfahren Neuter 96.03% 1,612 -zung die Verletzung Feminine 100.0% 1,531 -nie die Linie Feminine 98.55% 1,452 -stand der Zustand Masculine 99.86% 1,400 -fe die Reife Feminine 84.95% 1,628 -ber der Zauber Masculine 83.49% 1,599 -ke die Wolke Feminine 83.12% 1,540 -men das Unternehmen Neuter 79.92% 1,589 -nis das Verständnis Neuter 82.74% 1,466 A table of the 50 most useful suffix rules for German noun genders

You can access the table as a Google Sheet as well.

Okay, but how useful is this?

Great, we have a list of really useful suffixes! Does this mean we can guess genders as successfully as the computer can?

Actually, no, not at all. Like I said before, these few rules are much simpler than the model our classifier created above. This means that accuracy will also be much lower. To find out how much lower, I wrote some code to measure how often you would guess right, if you tried to guess genders using only the table above. I then ran it for all 30,215 words in the corpus. The results?

Using only the table above, you can know the noun genders of 31.17% of the most frequent 30,215 German words, with 94.27% accuracy. The other 68.83% of words is not covered by the suffixes above, so you would still have to “guess” them.

Conclusion

Okay, 31.17% accuracy might not solve your noun gender troubles for good. Still, German learners are (correctly) taught to learn every noun along with its gender. All of these genders are hard to keep track of, and if we can ease this process for even a quarter of nouns, I think that's already a great learning aid.

There are, of course, other factors to noun gender, but as far as suffixes go, I'm pretty happy with what I've learned from this analysis.

If you'd like to have a poke around the code, it's available on GitHub. However, I haven't been able to include the dictionary data along with it due to copyright reasons. In particular, one must specifically request the dict.leo.org data.

Do you have any ideas as to how this approach could be improved? Do you have any experience with analysing German grammar? I'd love to hear about it, so feel free to write to me. I hope you've found some of the information here useful, and for those of you also learning German, good luck going forward!