Deep learning is being used for a wide range of tasks both in business and academia. Companies increasingly implement deep models to enhance their customer service automation, build sophisticated real-time image processing systems, etc. while scientists experiment with neural architectures to achieve new breakthroughs in various areas of scientific research.

Today, we’ll be covering the implications of deep learning in natural language processing. We’ll review, briefly, the most powerful NLP technologies, touch on their pros and cons, and discuss how they’ve progressed over time.

Customer service automation, chatbots, and how it was in the old days

Before deep learning (neural nets) exploded a few years ago, NLP applications, including question answering bots, had hard-coded responses.

This means engineers had to predict users’ every request (and all possible wording variations) and compile a huge list of potential replies, tied to certain keywords, that a machine could use given a specific inquiry. These systems were terrible. They couldn’t hold a general conversation for a second, couldn’t understand the sentiment or preserve context. Instead of engaging a user, they infuriated them with their lack of comprehension and frequent misinterpretation of requests.

Then, machine learning researchers got their hands on big data and vast computational recourses and the deep learning boom happened.

Deep learning in NLP

Machines do not understand words, only numbers. Therefore, the language we communicate with should be converted into integers (or vectors of integers) for the algorithms to be able to process it. This is known as encoding.

The simplest type of encoding would be to turn a sentence like “I love cake” into “1” (for I), “2” (for love), and “3” (for cake.) “I love cake” would become “1 2 3”.

But a more common (and useful) way of representing words is through vectors.

Categorical variables, used mainly for classification, are often represented with one-hot encodings, which turn words into vectors of 0s and 1s.

If we were to use one-hot encoding for the example above, we’d create a three-dimensional vector instead of converting words into 1,2,3; here’s the representation we would get:

“I” – [1,0,0]

“love” – [0,1,0]

“cake” [0,0,1].

These simple vectors, as you can see, do not account for any possible semantic similarity between the words, so most deep learning methods in NLP rely on some type of word embeddings – dense vectors – instead.

One of the popular models used to create these embeddings is called Word2Vec. What it does is place words with a similar meaning (“good” and “well”) next to each other spatially.

Word2Vec models are shallow neural nets that come in two types:

Skip-gram – given a target (center) word the model calculates the probability distribution for the context words.

Continuous Bag of Words (CBOW) – given source context words, the model predicts the target word (by summing the vectors of the surrounding words).

We won’t go into details of the training process, nor will we describe Word2Vec’s architecture. We’ll just say that after being fed a large corpus of text, the model will eventually learn that some words have similar meaning within certain contexts and thus assign to them similar numerical embeddings.

Some advanced embeddings (such as those generated by BERT, which we’ll discuss later in the post) also help to deal with polysemy and represent a word’s meaning more precisely based on the context it appears in.

So, for example, when the word “bank” is surrounded by “money”, “investments”, “interest rates”, etc. the machine will know that it’s likely referring to a financial institution, whereas if its neighbors are “river”, “trees” and “shore” the meaning is probably land close to a body of water.

These fancy embeddings are effectively representations of words in an n-dimensional vector space. With the one-hot embedding approach, it’s unclear how to determine the needed length of a vector, while more advanced methods establish the size automatically (and it’s usually around 300.)

Application of Neural Nets

We’ll try to leave out as much math as possible and stick to intuitive explanations.

First, we must note that standard feed-forward NNs aren’t the preferred method for NLP tasks since they can’t handle sequential input of variable length (text, speech, etc.) They have a very constrained API, can only take in fixed-sized vectors and produce fixed-sized outputs such as class probabilities.

A traditional NN is comprised, roughly speaking, of an input layer, hidden layers (where calculations happen), and an output layer. The cells in hidden layers aren’t connected and operate independently from each another. They perform well on datasets with a fixed structure but to process a sequence they’d need to hang on to information from previous inputs as they’re working on a new one, which they cannot do.

One of the most prominent models that can process sequential data is Recurrent Neural Network.

RNN’s memory

RNNs add a twist to the conventional NN build by including a looping mechanism that allows information output by one layer to come back as part of the input on the next time step.

This is how it works:

Suppose we’re building a chatbot that tries to predict a user’s intent and consists of an RNN that processes input text and a feed-forward NN that classifies it.

Let’s say a user types: “What is the price of this watch?”

First, the sentence is chopped into a sequence of tokens – What(i1) is(i2) the(i3) price(i4) of(i5) this(i6) watch(i7) which our network will process one by one; here’s how it happens:

The embedding (vector) of “What” (i1) goes into the model along with the initial hidden state (h0); After processing these inputs, RNN outputs an altered hidden state h1; The word “is” (i2) enters the model paired with h1 (inside the unit, the vector is created that holds info about both i2 and h1; then, it’s passed through an activation function.) The network, therefore, takes in the new input whilst preserving the memory from previous time steps. i3 with h2, i4 with h3, i5 with h4, i6 with h5, i7 with h6 all go into the model one by one. The same computations happen on every time step. RNN’s final output, consisting of all the previous inputs, can be passed on to the feed-forward NN to classify the intent behind the sentence.

RNN’s uniqueness comes from its having connections between hidden layers. It’s actually replicating the same hidden layer many times over – the same weights and biases are applied to the inputs on each time step, hence the name recurrent. The network keeps looping, all the while modifying its hidden state, until it runs out of sequence.

The drawbacks

Vanilla RNNs, though exciting and more powerful than standard neural nets, still have a major flaw – a short memory. This is due to the vanishing be problem as well as the fact in RNNs, n output is a part of n+1 input, so, as more inputs are being processed, the network has trouble retaining the data from earlier steps in the sequence.

This means that each new input lessens the effect of the previous layer, and thus the first layers’ information ends up being washed out by the end of the sequence; In our example “What is the price of this watch?” the network would be trying to predict intent based on “this watch?” which wouldn’t be easy even for humans.

LSTM and GRU

The Long Short-Term Memory and Gated Recurrent Units are both advanced modifications of the RNN architecture; they’re also way more computationally intensive. The core idea behind them is to keep the network’s memory throughout data processing and thus capture long-distance dependencies; the second goal is to allow errors to flow with different strengths depending on the input.

LSTMs, inside their units, all have a cell state, through which data can move freely from the start to the end of a sequence, and three gates – neural nets with sigmoid activation functions – that decide which data is worthy of being included into the cell state and which should be ignored.

LSTM’s gates:

Input gate – determines how much the current vector matters;

– determines how much the current vector matters; Forget gate – determines when to forget the previous state;

– determines when to forget the previous state; Output gates – determines which information to output and which to store “quietly” in the cell state. Creates a “filtered” version of a cell state.

We won’t go through every computation that takes place in LSTM’s cells – that would be out of the scope of this post. Their aim, however, is to drop irrelevant information (in regular RNNs, hidden state is updated every time, no matter how insignificant a word is) and thus remedy, at least to a degree, the short-memory problem.

GRUs are a bit simpler in how they’re built, though they pursue the same goal – to mitigate the vanishing gradient’s impact. Besides the recurring vector, the architecture is designed to compute two gates – update gate and reset gate – first. These are, also, continuous vectors of the same length as the hidden state.

The reset gate allows us to drop information from the previous state that’s no longer of value. It has a sigmoid function (as does the update gate) so the value it outputs is always between 0 to 1; if it computes something closer to zero, we can ignore the previous hidden state completely.

The update gate decides how much data in the previous state should matter now – if it gives out a value closer to 1, we can just copy the previous time step and avoid updating our weights. This puts no strain on the cells (we’re not touching any weights) and thus helps prevent the vanishing gradient issue.

Both of these are quite elaborate networks that vastly enhance RNNs memory.

Not long ago, RNNs (LSTMs, GRUs) were the hippest architectures in the land of NLP and sequence to sequence models, which stack two RNNs together, were favored by most prominent researchers in the field due to their exceptional performance on key NLP tasks such as Neural Machine Translation.

Sequence to sequence models have an encoder and a decoder network. The first one processes, in ways discussed above, the input sequence and produces a hidden state, known as the thought vector, that carries information about the entire input, and then passes it on to the decoder, which, having that context, can generate appropriate target sequences (be it translations, replies in a dialogue system, or something else.)

Seq2seq models then became even more powerful when ML experts added attention mechanisms to their architectures.

In conventional seq2seq models, the assumption is that the context of all inputs in a sequence could be crammed into one fix-sized vector, but, actually, there’s a limit to how much information it could hang on to. At first, attention was used to help RNNs combat this. The mechanism added an extra input to each decoding time step that came from the encoding steps. It thus helped the decoder RNN to only focus on relevant parts of the input while predicting the target sequence.

Attention allows us to have not only lengthened short-memory (through using LSTMs), but a long memory as well.

Then, the reign of RNNs was suddenly over. In 2017, the paper “Attention is all you need” was released by researchers from Google in which they introduced Transformer.

Transformer isn’t just superior to all types of RNNs in terms of performance, it’s also a much lighter model that’s easier to train. Its clever architecture permits us to do away with recurrent (or convolutional) computations and achieve state-of-the-art results on different NLP tasks (especially in Neural Machine Translation) by relying solely on the attention mechanisms.

Under the hood, it, too, is comprised of an encoder and a decoder. Actually, there is a stack of encoders on one side and a stack of decoders (with an equal number of units) on the other.

Each encoder includes a self-attention layer (which helps the unit to inspect other inputs in the sequence when it’s encoding a given input) and a feed-forward layer to which the outputs from the self-attention layer are fed.

The decoder’s architecture repeats this and adds an extra layer – encoder-decoder attention layer – that allows the network to focus on the relevant parts of the input (similarly to what attention does in seq2seq models).

The first encoder receives a word embedding and then passes its output to the encoder directly above it. One of the key properties of this architecture is that each word has its path; there are dependencies within the self-attention layer, but the feed-forward layer doesn’t have them, which enables parallel computation.

In RNNs, the input must be processed by one layer so that we receive a hidden state which we can feed to the next one. It processes sequences one part at a time and therefore it is extremely slow. Transformers can be run on thousands of computers simultaneously working on different parts of the input at the same time.

What exactly is self-attention?

Look at the following example: The Petersons didn’t take trains in Italy because they were too dirty.

To grasp it, the machine needs to understand what “they” refers to in the sentence. We, as humans, can figure this out easily, but the machine must be given clues, and self-attention does just that.

When an input is being processed by Transformer, self-attention allows the layers to look at other inputs in the sequence to understand how to create a better encoding. So, when the model gets to the embedding of the word “they”, self-attention enables it to create a strong association with the word “Petersons”.

Each self-attention layer in Transformers is enhanced by the multi-head attention mechanism which:

expands the model’s ability to focus on various positions allows to create “representation subspaces” (I.e. to have different weight matrices applied to the same input.)

The idea behind multi-head attention, as per the original Transformer paper, is to enable the model “to jointly attend to information from different representation subspaces at different positions”

To establish the order of the input sequence, and not have Transformer look at it as just a mishmash of words, another vector – for positional encoding – is added to every input’s embedding.

In the end, the top encoder outputs a sequence of attention vectors that will be used as input for encoder-decoder attention layers that will help the decoder element to focus on relevant parts of the input when generating the target sequence.

To recap, Transformer’s benefits include:

No assumptions about spatial relationships between data points; this makes it perfect for processing sets of objects;

Computations can be executed in parallel as opposed to one by one as is the case with RNNs;

Data objects, even those located far apart in the input, can still affect one another’s output; this is without having to pass through many recurrent time steps;

Long-range dependencies can be learned easily.

In conclusion

The Transformer architecture, used initially for Neural Machine Translation, has been successfully adapted to various NLP tasks. Now, both Google and OpenAI, are also trying to leverage the promising concept of multi-head attention in other important areas of research.

Thanks to the original paper, we now have BERT (Bidirectional Encoder Representations from Transformers) – an incredibly powerful pre-trained language representation model that we can adapt to our specific problem by just adding one fine-tuned output layer; it enables us to create state-of-the-art question answering systems, chatbots, language inference systems, etc. without having to apply substantial task-specific adjustments.

Also, we’ve recently learned about GPT-2 (Generative Pre-Trained Transformer) – the unprecedentedly powerful language model which OpenAI decided not to release due to the researchers’ concern about the possibility of it being applied maliciously.

GPT-2 was trained on 40 GB of unlabeled text; it achieves state-of-the-art performance on a range of language modeling benchmarks and tasks such as questions answering, machine translation, text summarization – without having been trained for these specific problems.

It is truly a new day for NLP. The technologies available to us today can reshape customer service automation as we know it and scale companies’ customer support; they can make learning more available for people by summarizing scientific papers in a comprehensible manner, help create new, thrilling entertainment experiences, and much more.

Want to capitalize on the groundbreaking NLP technologies and deep learning?

Contact our expert right now for a free consultation.