A 10,000 foot overview of Neural NLP Architectures

In addition to better word vector representation the advent of neural has led to advances in machine learning architectures that have enabled the advances listed in the previous post.

This section will highlight some of the key developments in neural architecture that enabled some of the NLP advances seen thus far. This not meant to be an exhaustive review of deep learning and machine learning NLP architecture, rather the goal is to demonstrate the changes that are driving NLP forward.

Deep Feed Forward Networks

The advent of linear deep feed forward networks also known as multi layer perceptrons (MLP) in NLP introduced the potential for non linear modeling. This development helps with NLP because there are cases where the embedding space may be non linear. Take the following example of a documents whose embedding space is non linear meaning there is no way to linear divide the two document groups.

It doesn’t matter how you fit a line there is no linear way to split the spam and ham documents

A non linear MLP network provides the ability to properly model such non linearities.

This development by itself however did not bring about a significant revolution in NLP, since MLPs are unable to model word ordering. While MLPs open the door for marginal improvements in tasks such as language classification, where decisions can be made by modeling independent character frequencies, for more complex or ambiguous tasks standalone MLPs fall short.

1D CNNs

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification

Prior to their application in NLP Convolutional Neural Networks (CNNs) provided groundbreaking results computer vision with the advent of AlexNet In NLP instead of convolving over pixels convulsion filters are applied and pooled sequentially over individual or groups of word vectors

In NLP CNNs are able to model local ordering by acting as n-gram feature extractors for embeddings. CNN models have contributed to state of the art results in classification and a variety of other NLP tasks.

More recently the work of Jacovi and Golberg et al, has contributed to deeper understanding of what convolutional filters learn by demonstrating that filters are able to model rich semantic classes of n-grams by using different activation patterns, and that global max-pooling induces behavior which filters out less relevant n-grams from model decision process.

A good primer on getting started with 1D CNNs can be found in the embedded link below.

RNNs (LSTM/GRU)

Building on the local ordering provide by CNNs Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) provide mechanisms for modeling sequential ordering and mid range dependencies in text such as the affect of a word in the beginning of a sentence on the end of a sentence.

Additional variations of RNNs such as Bidirectional-RNNs which process text in both left to right and right to left and character level RNNs for enhancing underrepresented or out of vocabulary word embeddings led to many state of the art neural NLP breakthroughs.

An sample of some different RNN architectures and coupled with example use cases.

Attention and Copy Mechanisms

While standard RNN architectures have led to incredible breakthroughs in NLP they suffer from a variety of challenges. While in theory they can capture long term dependencies they tend to struggle modeling longer sequences, this is still an open problem.

One cause for sub-optimal performance standard RNN encoder-decoder models for sequence to sequence tasks such as NER or translation is that they weight the impact each input vector evenly on each output vector when in reality specific words in the input sequence may carry more importance at different time steps.

Attention mechanisms provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. These mechanisms are responsible for much of the current or near current state of the art in Natural language processing.

An example of an attention mechanism applied to the task of neural translation in Microsoft Translator

Additionally in Machine Reading Comprehension and Summarization systems RNNs often tend to generate results, that while on first glance look structurally correct are in reality hallucinated or incorrect. One mechanism that helps mitigate some of these issues is the Copy Mechanism.

Copy Mechanism from Get To The Point: Summarization with Pointer-Generator Networks Abigail See, et all

The copy mechanism is an additional layer applied during decoding that decides whether it is better to generate the next word from the source sentence or from the general embedding vocabulary.

Putting it all together with ELMo and BERT

ELMo is a model generates embeddings for a word based on the context it appears thus generating slightly different embeddings for each of its occurrence.

For example, the word “play” in the sentence above using standard word embeddings encodes multiple meanings such as the verb to play or in the case of the sentence a theatre production. In standard word embeddings such as Glove, Fast Text or Word2Vec each instance of the word play would have the same representation.

ELMo enables NLP models to better disambiguate between the correct sense of a given word. On in it’s release it enabled near instant state of the art results in many downstream tasks, including tasks such as co-reference were previously not as viable for practical usage.

ELMo also provides promising implications for preforming transfer learning on out of domain datasets. Some such as Sebastien Ruder have even hailed the coming ELMo as the ImageNet moment of NLP and while ELMo is a very promising development with practical real world applications, and has spawned recent related techniques such as BERT, that use attention transformers instead of bi-directonal RNNs to encode context, we will see in our upcoming post that there are still many obstacles in the world of Neural NLP.

Comparsion of BERT and ELMo architectures from Devlin et. all

Call To Action: Getting Started

Below are some resources to get started with the the different word embeddings above.

Documentation

Tools

Open Dataset

Now that we have a solid understanding of some of the milestones in neural NLP, as well as the models and representations in the next post will review some of the pitfalls of current state of the art NLP systems.

If you have any questions, comments, or topics you would like me to discuss feel free to follow me on Twitter.

About the Author

Aaron (Ari) Bornstein is an avid AI enthusiast with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.