I had the pleasure of spending a couple of months this summer interning at the ADAPT Centre in Dublin City University, working with the GaelTech team of Dr. Teresa Lynn and PhD candidates Abigail Walsh and Meghan Dowling, on resources and tools for Irish language technology.

Foireann GaelTeic ag an ócáid intéirneachta inniu san @AdaptCentre! Tá sárobair déanta ag Noah maidir leis a chuid taighde ar theicneolaíocht na #Gaeilge pic.twitter.com/DJA2WlvKqF — Meghan Dowling (@ismisemeg) July 18, 2018

I had a lot of fun while I was there, and learned a lot about research and the postgraduate life. I will finish my undergraduate degree soon, and I have no doubt that this experience will be extremely useful when I choose what to do afterwards!

I got to work on a few very interesting things at ADAPT. Below is a photo Meghan took of the poster I procrastinated in making that sums up some of my projects.

Tá grianghraif agam de, ach seans nach bhfuil sé ró-shoiléir. Is le @iandioch é agus tá @aneihe ag obair ar an stuif seo chomh maith, beidh tuilleadh eolais acu! pic.twitter.com/Tjvymx2l1J — Meghan Dowling (@ismisemeg) July 18, 2018

However, not mentioned on the poster is my final project, which was to do with programmatically answering yes/no questions in Irish. Below, I explain how "yes" and "no" work in the language, try to show why this causes an issue in machine translation, before briefly explaining how I tried to get a computer to approach the problem.

"Yes" and "No" in Irish

Irish is a Celtic language, spoken (spoiler alert) on the island of Ireland. It has 2 grammatical genders, 4-ish grammatical cases, VSO word order, and has no words for "yes" or "no" .

The lack of words for "yes" and "no" is surprising to many people I mention it to. If you were to translate the answers to the question "Is she ok?" to Irish, it would look something like this:

English is she ok? yes no Irish an bhfuil sí ceart go leor? tá níl Literal translation is she alright? is isn't

As you can see, instead of "yes" or "no" , you reply with "is" or "isn't" . The person answering the question takes the verb from the question and repeats it for the response.

Another example:

English will you go to the disco tonight? yes no Irish an rachaidh tú chuig an dioscó anocht? rachaidh ní rachaidh Literal translation will you go to the disco tonight? will go won't go

Machine translation systems struggle with differences in languages like this one. Google Translate is probably the most popular machine translation system in use right now; here is its attempt at translating something along these lines:

English Will you go to the disco tonight? Yes? No? Google Translate's Irish An dtéann tú ar an dioscó anocht? Tá? Níl? Literal translation of the "Irish" Do you go to the disco tonight? Is? Isn't?

It is clear to see that this is a bad translation. In Irish, in order to correctly translate a "yes" or "no", you need to reference the question you were asked. You take the main verb, maybe inflect it in some way, and respond with it. You can't just translate the word "yes", in isolation; you need to know what you're saying "yes" to. Google Translate doesn't seem to know how to do this.

Languages are weird

The above examples are complicated by how English uses auxiliary verbs "will" and "won't" for its future tense. Also, English word order changes from being SVO ( "he is a dog." ) in normal phrases to being VSO ( "is he a dog?" ) in questions. English is a weird language.

See the following for how the structure of a more normal phrase, instead of a question, compares in Irish and English:

English You will go to the disco tonight. Irish Rachaidh tú chuig an dioscó anocht. Literal translation ["go" in future tense] you to the disco tonight.

Irish doesn't have an auxiliary verb here like English does, and it is consistently VSO in phrases like this.

English is indeed weird in switching word order and using auxiliary verbs, but of course, Irish is weird in its own ways too. For example, try and figure out how the following translation makes sense:

English do you have a dog? yes no Irish an bhfuil madra agat? tá níl Literal translation is a dog at you? is isn't

We saw earlier in the "is she ok?" example that "an bhfuil?" means "is?" in Irish. "is <something> <something>?" doesn't seem like it can translate as "do you have a dog?" , but let's see.

"madra" is the Irish for "dog" . But Irish has no indefinite article, so "madra" can equally mean "dog" or "a dog" . So now we know "an bhfuil madra agat?" is "is [a] dog <something>?" . Irish has prepositional pronouns; it combines prepositions (like "ag" , which means "at" ) with pronouns (like "tú" , which means "you" ), to create one word (like "agat" , which means "at you" ). For some more examples: "agam" means "at me" ; "aige" means "at him" ; "romham" means "before me" . So how does "is a dog at you?" mean "do you have a dog?" ? The language has no verb "to have" - it uses an "existential clause" to indicate possession instead. "I have a dog" is translated as "a dog is at me" . Similarly, "you have a cat" would be "tá cat agat" , and "Seán had bananas" would be "bhí bananaí ag Seán" .

Maybe it makes some sense now that the phrase "do you have a dog?" is "an bhfuil madra agat?" (literally "is a dog at you?" ) in Irish. I suppose Irish is a weird language too.

Language quirks like these mentioned above further complicate translating a simple "yes" or "no" answer. We saw previously that you need to analyse the verb in the question before you give an answer, but now we see that the question verb might not be the same in the two languages; the main verb of "do you have a dog?" is "have" in English, but " is " in Irish. So even if all you want to do is translate the word "yes" , you'll first have to translate whatever question you are trying to answer!

Sidenote: the copula

Irish has this thing called the copula, which is kind of like a verb but isn't actually a verb. Lots of yes/no questions asked are based on this, and are answered in a different way to verbal questions.

The copula is used to express identity or equivalence, but I'll leave you to read the wikipedia bit about it linked above for a better explanation. I'll just give a few examples of its usage:

English is Máire a doctor? yes no Irish an dochtúir í Máire? 'sea ní hea Literal translation is a doctor her Máire? are they aren't they

A slightly different question changes the question and answer quite a bit:

English is Máire the doctor? yes no Irish an í Máire an dochtúir? is í ní hí Literal translation is her Máire the doctor? is her isn't her

And here's a more complicated usage:

English would you like a drink? yes no Irish ar mhaith leat deoch? ba mhaith níor mhaith Literal translation would be good with-you a drink? would be good wouldn't be good

How can a computer do this?

Fluent Irish speakers can do all of the above without thinking about it. But it's difficult to get a computer to do it.

Guided by Dr. Teresa Lynn, I made progress on a tool to do this. Following is the steps taken.

First, I compiled a small corpus of yes/no questions and answers in English and Irish. I used this to help develop and to test the tool. You give the tool a question in Irish that you would like to answer with a "yes" or "no", by eg. running the command echo "an bhfuil sí ceart go leor?" | python3 process.py Dr. Elaine Uí Dhonnchadha wrote a part-of-speech tagger for Irish that is run by this tool. The output of the POS-tagger is converted into CoNLL-U Format. The CoNNL-U data is passed to Maltparser, trained on the Irish dependency treebank. This adds grammatical annotations to the question, showing the relations between words. The tool can now analyse the grammar of the question, and decide if it is a verbal question, or is based on the copula[1].

If the question is verbal, the tool tries to gather information about the lemma and tense/aspect/mood of the verb.

If the question is copular, the tool tries to find the copula form used[2], and its lemma. It also tries to find the associated predicate, and its lemma. It outputs a report based on this information, that can be used to generate a response to the question[3].

Here's an example output of the tool:

{ "type": "verb", "error": null, "verb": { "lemma": "buail", "surface": "mbuailfidh", "tams": [ "FutInd" ] }, "question": "Nach mbuailfidh Eilís an bithiúnach ?" }

It tells you that the question, "Nach mbuailfidh Eilís an bithiúnach?" ( "Won't Eilís hit the scoundrel?" ), is verbal, based on the verb "buail" in the FutInd (future indicative) tense. Your answer should then also be the verb "buail" in the FutInd tense. The tool outputs all this in JSON, so it can be easily parsed by another tool to synthesise the verb response. I really want to put together this second tool soon, so that the full pipeline will be complete.

I learned so much in doing this project, and just as much again in doing the other projects across my internship. More importantly, I got to work with a language I love with some superb people, and I'm very thankful.

There's still no shortage of other work to do in Irish language tech technology (and no shortage of improvements to make to just this tool). Hopefully I'll get to do some more soon!

[1] Right now, the tool decides if the question is based on the copula by just looking at the root of the dependency tree, and seeing if the word there isn't a verb. Unfortunately, Maltparser sometimes incorrectly parses the question, and picks a verb in the question that shouldn't be the root to be the root (and sometimes giving a false-negative to the question being copula-based), or chooses a non-verb to be the root of the sentence where it should have chosen a verb (and giving a false-positive). Edit: I was suggested to change the word "top" here to "root", to maybe make this footnote more easily understood. I also need to note that the inaccuracy mentioned here is by no means the fault of Maltparser, but is due to the fact that the Irish treebank that Maltparser is trained on is very small, at just 1020 trees. Irish is a low-resource language! If/when there is more data, it will be more accurate.

[2] The tool tries to walk down the tree, finding the uppermost item tagged as being the copula. However, the part-of-speech tagger sometimes struggles with the copula ( "an" is often marked as being a definite article instead of the copula, and "ar" is sometimes marked as a preposition instead of the copula, for example). To try to work around this, there's an explicit list of likely forms of the copula in a question that are also looked for, regardless of the part of speech reported for these words.