Esperanto NLP Part 2: Finishing Sentences

Does the Esperanto model understand what it’s saying?

In Part 1, I explained why I focused on Esperanto, and how a Python script plus TensorFlow can generate grammatically-correct text.

For this next part, on the path to building NLP tools such as a grammar checker or autocomplete, I have these goals (from smallest to largest):

Complete a sentence in a grammatically correct way (for example, in the sentence “Adamo parolas Esperanton”, the suffix -n is placed on the direct object).

Complete a sentence in a contextually accurate way (suggesting “Esperanton” for the verb “parolas” (speaks), or “pomon” (apple) for the verb manĝas (eats).

Check and address bias in the contextual sentence completion (e.g. completing the sentence “he is __” vs. “she is __” ).

We also discover a serious error in the existing model! Ouch! 🤦

2020 Update

For a modern approach using Transformers, see how HuggingFace trained an Esperanto model:

https://huggingface.co/blog/how-to-train

Working with our Esperanto model

The model (as I understand it) is stored in the checkpoints directory. In the codebase which we’re using, rnn_play.py was written to use a completed model to generate a new Shakespeare play. Let’s repurpose that code to complete a simple sentence “Mi estas” (I am).

In rnn_play the model suggests one character at a time, with the next character based on knowledge of the previous ones, so I wasn’t sure how to set a starting point. I tried replacing the initial ‘X:0’ array without success.

Then I came up with a bit of a hack, if the character-suggesting loop count is less than the length of startPhrase, we sneakily substitute the model’s suggested character with the next character in startPhrase.

Here are sentences from the first two runs:

Mi estas produktita de pli malgrandaj por la planto kiu estas la plej multaj plantaj produktantoj de la plej gravaj produktoj de la manao de la malsamaj specialaj sistemoj. Mi estas la unua konsideranta liberecon de la superreganto de laboro de la sendapora komponao de la substanco.

The first one is grammatically correct but incomprehensible. The second one starts out well enough (“I am the first considering freedom of the dominance of work”) but overall doesn’t make sense.

Testing for contextual accuracy

I changed the starting phrase to “Li parolas ” (he speaks).

Li parolas pri la superforto kaj priskribis la subtenon de la alta mondo.

He speaks about the strength and described the support of the high world.

This is abstract, but from the choice to use the word ‘described’, it seems like some context is there. Now with past tense:

Li parolis pri la substanco kaj la aliaj landoj de la plej alta procezo.

He talked about the substance and the other countries of the most high process.

Trying “scribas” (writes):

Li scribas pri la plej malnovaj sistemoj de la produktado de la superregado de la angla lingvo.

He writes about the oldest systems of the production of the dominance of the English language.

And “adoras” (adoring):

Li adoras lingvan signifon.

He adores language meaning.

I attempted to use “manĝas” (eats) but I have overlooked that this script uses only a part of ASCII to turn text into numerical categories in the training, omitting accented letters. These are fairly common in Esperanto (the third-person pronouns include ŝi and ĝi). This is a major issue, and I should re-train the model before going any further.

Re-training the model

The current code expects ASCII letters without accents. The 256 ASCII codepoints are mostly alphabetical (A being 65, Z being 90, a being 97, z being 122) plus punctuation and other common special characters. In my_txtutils.py, the model reduces this to a subset of only 98 characters.

To include letters such as ĝ we need to check our Unicode tables, which say Latin Extended-A continues up to 383. Esperanto uses 6 of these letters (meaning 12 when including uppercase and lowercase), so we can keep our alphabet subset small at 110. I make a lookup dict for these additional letters (though the program mostly uses this secondary dict to look up letters by codepoint).

I let the new model run for ~12 hours. It is now producing words with accents, and not excessively so. It also has a charmingly alliterative opening with “the process of the progress…”

La procezo de la progreso estas la unua kaj kultivataj en la mondo kaj de la proverbaro de la provinco kaj la plej multe pli ol 15000 koloniaj kontraŭuloj.

Time to try one of the previous sentence completions:

Li adoris la tradukon de kontinento en Ĉinio kiel la plej multaj el tiuj elementoj de la mondo.

It has the -n ending, and is grammatically correct. Now for the context question — what to eat?

Ili manĝis la terminon kiel en malgranda komputilaĵo.

No, that’s no good at all! What does eating have to do with small computers? I searched my Wikipedia source text, and although a few dozen articles use “manĝas” or “manĝis”, it’s more common to write some form of the noun “manĝo” (food). I tried a few other words before getting a few nice answers for “iris” (went).

Li iris al la planedo de lia propra regado.

He went to the planet of his own dominion. Li iris al Aleksandro.

He went to Alexander.

Prompting with “Ŝi iris al ” (she goes to):

Ŝi iris al Ateno, kie li estis priskribita en la jaro 2010.

She went to Athens, where he was describing in the year 2010.

But for the majority of runs, the output is not related to the action of going. I tried cutting off the prompt at “Li dir” (he says/he said) to see which verb tense might be preferred by the script:

Li diris “ke la proporcio de tiu signifo estis konsiderata kiel “malfruaj “popoloj en la termino Li diris, ke la termino “malsama senco” estas la plej granda por la plej grava ekonomia koncerna por la produktado kiel la tutmonda komunuma por la tu Ŝi diris ke la plantoj estis presitaj. Ŝi diris “ke “la plej grava parto de la mondo pri la ekonomiaj kontinentaj popolaj lingvoj estis kon ataj, kaj en la tuta loĝantaro de la malvarma mili…

I think that we’re seeing an encyclopedic bias, where it is unlikely to see an article about what Thomas Jefferson ate or other action verbs, but we are likely to see the words “He said…” followed by quotation marks, or “She said, that [statement].”

I could include Esperanto text from newsletters and books to try and counteract this.

Looking for bias

I decided the best place for sexist bias in the model would be “Lia laboro” (his work) versus “Ŝia laboro” (her work).

Ŝia laboro estas pli malgranda por la plej grandaj kaj plenumataj landoj en la marbordo de la mondo. Ŝia laboro estas la unua kontinento, kie la radikoj de la progreso estas pli multaj antaŭ la malsupea kaj plejparto de la mondo, kie li estas plej oft Lia laboro en la moderna senco de la reĝo Portegalo kiuj estas precipe en la tutmonda parto de la malsama sistemo de la mondo. Lia laboro de la malvenko en la arbaro en la jaro 1990. Lia laboro estas preskaŭiuj elementoj de la plej multaj arboj de la plantoj.

These sentences are meaningless, and though initially I saw a weird number of references to plants for women, the last sentence includes plants and trees for men, too.

The process if I did want to reduce bias, would be something like:

Notice gendered pronouns, nouns, and maybe names (?) and fork the current sentence before it started.

For each branch of the sentence, re-run all prior text of the sentence, and substitute in opposite-gender pronouns, nouns, or maybe names.

When it is time to make suggestions for letters, combine probabilities of all branches, OR allow both branches to pick letters until they make a word, then find some way to decide which word was better on some level.

In the next step I will let a model run for longer, and see if I can get it to suggest words and catch grammar errors!

Continue reading with Part 3: Correcting Grammar