Achieving State-Of-The-Art Results In Natural Language Processing

Part 1: Fundamental Concepts

Badoo is the largest dating network in the world, with over 400 million registered users worldwide. Users of our site are prompted to write an “About me” section in order to give any prospective love interests a little bit more information about themselves.

We perform Natural Language Processing (NLP) on this text data to derive insights into who our users are, their hobbies/interests and what they are looking for in a partner so that we can help them find a good match.

The Badoo Data Science team are in the process of researching ways in which we can enhance our current approaches to such NLP tasks. In this two-part article I will give an overview of our findings; firstly with an introduction to transfer and multi-task learning, and then looking at some core concepts. Part two of this article will explain how these concepts have supported the achievement of some state-of-the-art (SOTA) results using architectures such as ELMo, BERT and MT-DNN.

Transfer vs. multi-task learning

Transfer learning is the process of training a model on a large dataset and then using this pre-trained model to learn another target task. One of the key benefits of transfer learning is that models can give very good results with small amounts of data whereas traditional approaches require a huge dataset and have long training times. Initially very popular in the field of computer vision, there have been many developments in recent times achieved when applying transfer learning to NLP tasks.

Multi-task learning is a type of transfer learning in that it transfer knowledge across tasks. The difference between the two is that multi-task learning is performed in parallel while transfer learning is sequential.

Word embeddings

The concept of word embedding plays a critical role in the realisation of transfer learning for NLP tasks. Word embeddings are essentially fixed-length vectors used to encode and represent a piece of text.

The vectors represent words as multidimensional continuous numbers where semantically similar words are mapped to proximate points in geometric space. You can see here below how the vectors for sports like “tennis”, “badminton”, and “squash” get mapped very close together.

A key benefit of representing words as vectors is that mathematical operations can be performed on them, such as:

King — man + woman = queen

Importance of context

To get a good representation of text data it is crucial for us to be able to capture both the context and semantics. For example, consider the two sentences below, whilst the spelling of the word “minute” is the same in both cases, their meanings are very different.

In addition to this, the same words can also have different meanings based on their context. For example, “good” and “not good” convey two very different sentiments.

By incorporating the context of words we can achieve high performance for downstream real-world tasks such as sentiment analysis, text classification, clustering, summarisation, translation etc.

Traditional approaches to NLP such as one-hot encoding and bag of words models do not capture information about a word’s meaning or context. However, neural network-based language models aim to predict words from neighbouring words by considering their sequences in the corpus.

Context can be incorporated by constructing a co-occurrence matrix. This is computed simply by counting how two or more words occur together in a given corpus.

Word2Vec & Global Vectors for Word Representations (GloVe)

Word2Vec and GloVe are both implementations which can be used to produce word embeddings from their co-occurrence information. Word2Vec only takes into account local contexts. By contrast, GloVe uses neural methods to decompose the co-occurrence matrix into more expressive and dense word vectors; matrix factorization is performed to yield a lower-dimensional matrix of words and features where each row yields a vector representation for each word.

Bidirectional-Recurrent Neural Networks

The context of a word doesn’t simply depend on the words before it, but also those following it; bi-directionality matters. Bidirectional-RNNs process text running both left to right and right to left, while character level RNNs — for enhancing underrepresented or out of vocabulary word embeddings — have led to many state-of-the-art, neural NLP breakthroughs.

Attention mechanisms

Attention mechanisms provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. These mechanisms have played a significant part in recent advances in NLP tasks.

Summary

In this article we have provided an overview of some of the core concepts that have helped achieve some SOTA results on NLP tasks. Keep an eye out for part two of this article, which will explain how these concepts have been leveraged in order to achieve these outcomes.

By the way, if cool data science and research projects like this one interest you, we are hiring, so do get in touch and join the Badoo family! 🙂