Messing With Intents Translation — Part I

Generative Adversarial Networks at intent mapping

Photo by Dmitri Popov on Unsplash

I bet you’ve heard a story in two different media and wondered if they were talking about the same thing. At Lang.ai that happens a lot.

It’s not that we are feeding content to cater for the whole media spectrum. Quite the opposite, we analyze data with a multinational reach, written in many languages, with all their cultural inflexions, ways of saying things and personal conditions.

If you happen to make a worldwide launch of a video game, you may be interested in reactions from every corner of the globe. Moreover, you will need a global analysis, and evaluate cross-lingual relations among comments.

A common question is “are these people talking about the same?” or similarly, “can I group comments by intent regardless of the language?”

Online automatic translators is the first idea we all come across. It entails parsing all the documents in a source language, translating them into the target language and checking if they have a good match. However, the more domain-specific the text is, the more difficult it is for translators to do a good job.

We prefer to use a workaround: find first what pieces of text are about, and then check how similar they are among each other.

We use intents as semantic blocks that summarize what a piece of text is about. Many intents can be present in a text, so we can group several documents related to a common intent.

Our customers use our technology to inspect unstructured data and discover the main intents underneath. So far, intents are separated by language. Then, two groups of documents about the same game situation in two languages will lay in two separate intents. We may guess there is some relationship between them but we do not know its strength and, of course, there is no source-target intent pair to be used in a supervised learning.

We need an unsupervised approach to score the similarity among intents in source and target languages.

The Mikolov’s clue

Let’s think for a moment that we try to map words instead of intents in two languages. Mikolov et al. computed word embeddings for source and target language separately and realized that there is a linear mapping W, specially for close languages, to match words in two languages.

The equation above entails that you can compute separate monolingual embeddings in spaces to learn such a mapping whenever you have enough known (x; y) pairs.

This mapping can be seen as a linear approximation, but getting good translation results with such linear mapping is fantastic news. It implies that you just need to rotate the embeddings to match the topology they have in the other language. The closer the languages are, the more accurate the approximation is: mappings between English and Spanish are better than mappings between English and Chinese.

Figure 1. Iterative adjustment process between monolingual word embeddings. Source: Conneau et al.

But Mikolov used a dictionary, i.e. (x; y) pairs, and we look for an unsupervised approach. Then, we just need to get rid of the dictionary. In separated works, Conneau et al. explain how to obtain a mapping W without initial (x; y) pairs and Artetxe et al. iterate greedily over a small easy-to-obtain list of pairs.

The method by Conneau et al. (MUSE) is more general since it imposes less restrictions, and get results quite similar to the Mikolov’s supervised approach. They use a Generative Adversarial Network (GAN) to generate a good mapping W, minimizing an overall distance between mapped and target words.

We tested this approach with words, using Wikipedia embeddings precomputed with FastText but restricted to words just from a specific domain. We got the words from a corpus of our interest from a semi-aligned twitter newsfeed in Spanish and English from: BBC (@bbcmundo, @BBCWorld), CNN (@CNNEE, @cnni), and elpais.com (@elpais_espana, @elpaisinenglish).

Table 1. Highest scores for word (ES) to word (EN) translation with a twitter newsfeed dataset

Intents are nonetheless made up by two or more words and represent categories. They group documents together and can be featured by an underlying language model. Therefore, intents are combinations of words with a meaning, very much the same as sentences.

It raises the need of creating intents/sentences embeddings: dense vectors grasping the meaning of the sentence. Word (unigram) embeddings datasets typically overpass 1 Billion words. Hence, the amount of data you would need to obtain good sentence (n-grams) embeddings would be prohibitive. Researchers overcome this drawback combining word embeddings a posteriori in such a way that they represent the meaning of a sentence.

We did not find much literature about intent embeddings but there are some papers about sentence or paragraph embeddings. Two main trends to represent sentence meaning arise: 1) combine word embeddings to create a new vector with equal (average) or different (tf-idf-like) weights; 2) stack word embeddings so the sentence is represented by the subspace spanned by its word embedding vectors (Mu et al., 2017), (Arora et al., 2016b). Looking at the scores reported in those papers, results do not differ much between both approaches, so we use the first approach, weighting words according to their importance in the intent model. This approach gives us a bit more freedom in order to create our own importance weights.

We computed intent embeddings as a linear combination of most important words embeddings in intent models.

Where weights are computed as the pointwise information gain between word and intent:

And then normalized to sum one.

In order to test the approach, we tried several mapping setups and parameters:

1. Map words, then find the intent mapping

2. Map intents directly

3. Mix words and intent embeddings, map all of them

Scores were computed as proposed by Conneau et al. cross-domain similarity local scaling (CSLS), a cosine distance modified with a measure of average distance between mapped and target embeddings.

The best option to find a good mapping was mixing words and intent embeddings and finding a map for all of them together. We performed a grid search on several parameters and found that default parameters worked fine in general, but we needed the hidden dimension of the discriminator to exceed 1024 units.

Best scoring intent matches are in Table 2. In our experience, scores over 0.06 usually provide a reliable good mapping and about 15% of intents reached that score at the news dataset.

Table 2. Highest scoring matching intents among Spanish and English datasets. Only positive scores are shown up to 5 intents. Most of the matched intents are correct except ‘league+champions’ that didn’t have a counterpart in the dataset.

We also tested the refinement method based on Procrustes algorithm but it consistently made our intent mappings worse.

Cross-and intra-domain testing

Alternatively, we tested embedding mappings on other datasets with categories annotated with some human crafted rules. We had some Spanish and English twitter feeds from banking sector, as well as other domains just in Spanish, so we could perform two types of analysis:

Mono-language cross-domain intent translation: ES to ES

Cross-language mono-domain intent translation: Banking (ES to EN)

First we selected domains with enough, interesting and nominally shared categories. In some cases (banking, telco) they are different domains, and in others (banking, spanish_bank) they are domain-subdomain pairs.

We extracted intents from these datasets and tested the mappings between our supervised categories, unsupervised intents, words and mixes of them. Our results are consistent with the news dataset for the linguistic categories: good precision for higher scores but not so good recall. It works alike for cross domain categories telco + banking (table 3), telco + shopping_center (table 4) and intra-domain banking + spanish_bank (table 5).

Table 3. Highest scoring matching supervised categories among Spanish and English in banking (left) and telco (right) dataset. Only positive scores are shown.

Table 4. Highest scoring matching supervised categories among Spanish and English in telco (left) and one_shopping_center (right) dataset. Only positive scores are shown.

Table 5. Highest scoring matching supervised categories among Spanish and English in banking (left) and one_spanish_bank (right) dataset. Only positive scores are shown.

Whereas precision is still quite impressive for telco to banking and intra-domain banking to spanish_bank, results are not so outstanding for telco to shopping_center domains.

Apparently, the similarity among domains plays a key role in those results, but we have not extensively verified it.

In addition, results for cross-domain matching for unsupervisedly drawn intents were quite poor in all the cases we tested.

Categories usually have much larger support (amount of documents related to them) than intents, so they are better defined by their underlying models.

Regarding the cross-language mono-domain intent translation problem, the matching is not good in any of the mappings we tested: category to word, category to intent, intent to intent and intent to word. Looking at the data, we think that topics users talk about in the two datasets are rather different. They talk about specific situations and events that have very few points in common.

Insights

Mikolov showed an interesting fact: in close languages, embeddings have a similar structure. Using this structure can be enough to find good word translations.

GANs are an interesting artifact to generate word distributions in the word to word translation problem without using parallel data.

However, when analyzing real-world problems it is often the case that you have some domain-specific data. Usually the amount of data is not enough to train your own good embeddings so you need to borrow some pre-trained embeddings that are typically computed from Wikipedia and other open sources.

Mikolov just showed the relationship between embeddings but nothing about sentence or any other aggregation of embeddings. Our experiments intended to show how it works with intents and categories shaped up as a combination of word embeddings.

We found out that if you have a good model defining your categories or intents that have an extensive support, you can achieve results with good precision, although recall is usually poor. Despite we do not have a gold standard dataset to help us support this claim, a visual inspection of results drove us into this direction.

In addition, common ways of creating sentence embeddings from word embeddings seem to have much improvement margin, and any enhancement would directly impact the results we obtained.

It’s hard to say -in an unsupervised way- if two groups of people in different languages are talking about the exact same thing if the domain is quite specific (except for a few group of topics).

In the second part of this series, we plan to better examine word embeddings, their translations and how they deal with polysemy.