Transfer learning, an approach where a model developed for a task is reused as the starting point for a model on a second task, is an important approach in machine learning. Prior knowledge from one domain and task is leveraged into a different domain and task. Transfer learning, therefore, draws inspiration from human beings, who are capable of transferring and leveraging knowledge from what they have learned in the past for tackling a wide variety of tasks.

In computer vision, great advances have been made using transfer learning approach, with pre-trained models being used as a starting point. This has sped up training and improved the performance of deep learning models. This is attributed to the availability of huge datasets like ImageNet, that have enabled the development of state-of-the-art pre-trained models used for transfer learning.

Until recently, the natural language processing community was lacking its ImageNet equivalent. But development of transfer learning techniques in NLP continues to gain traction. In NLP, transfer learning techniques are mainly based on pre-trained language models, which repurpose and reuse deep learning models trained in high-resource languages and domains.

The pre-trained models are then fine-tuned for downstream tasks, often in low-resource settings. The downstream tasks include part-of-speech tagging, text classification, and named-entity recognition, among others.

Contextualized Embeddings

Word embedding plays a critical role in the realization of transfer learning in NLP. The intuition behind word embeddings is that words are represented as low-dimensional vectors that capture both the syntax and semantics of the text corpus. Words with similar meanings tend to occur in similar context.

The word representations are learned by exploiting vast amounts of text corpora. A popular implementation of word embeddings is the Word2Vec model which has two training options—Continuous Bag of Words and the Skip-gram model. Word embeddings are often used as the first data processing layer in a deep learning model.

One limitation of standard word embedding techniques such as Word2Vec, fasttext, and Glove is that they aren’t able to better disambiguate between the correct sense of a given word. In other words, each instance of a given word ends up having the same representation regardless of the context in which it appears.

Recently, contextual word embeddings such as Embeddings from Language Models (ELMo) and Bidirectional Encoder Representations from Transformers (BERT) have emerged. These techniques generate embeddings for a word based on the context in which the word appears, thus generating slightly different embeddings for each of word’s occurrence.

ELMo uses a combination of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. On the other hand, BERT representations are jointly conditioned on both the left and right context and use the Transformer, a neural network architecture based on a self-attention mechanism. The Transformer has been shown to have superior performance in modeling long-term dependencies in the text, compared to recurrent neural network architecture.

The integration of the contextual word embeddings into neural architectures has led to consistent improvements in important NLP tasks such as sentiment analysis, question answering, reading comprehension, textual entailment, semantic role labeling, coreference resolution, or dependency parsing.

Language model embeddings can be used as features in a target model or a language model can be fine-tuned on target task data. Training a model on a large-scale dataset and then fine-tuning the Pre-trained model for a target task (transfer learning, if you’ll recall), can particularly be beneficial to low-resource languages where labeled data is limited.