The Transformer was originally shown to significantly outperform recurrent neural networks for machine translation. However it has since been applied to a range of applications in natural language processing, from question answering, document summarisation, sentiment classification and the modelling of natural language – a task that has seen particular exciting developments over the past year.

Modelling natural language

Finding machine learning tasks which both drive the development of better memory architectures and push us further towards artificial general intelligence is challenging. Statistical language modelling is one such task that we believe could be valuable for both purposes. Language models work by sequentially predicting the next word in a stream of text. They can be used to model existing texts and also to generate novel texts. As they get better at modelling the past, their predictions become more accurate, and the texts they generate become more realistic.

In Claude Shannon’s seminal article “A Mathematical Theory of Communication” published in 1948, which founded the field of information theory, he discussed primitive language models and illustrated how adding more context improves the quality and realism of generated text. He does this by introducing the most simple model of English text, which has no contextual modelling at all – a character-level model which treats each character independently. By sampling characters with their relative frequencies (8% of the time for ‘a’, 1.5% for ‘b’ etc.) we arrive with a nonsensical string :

XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.

However, he remarks at the improvement in sample quality if one instead models the probability of words independently. Now the modelled context is approximately 7X larger (the average number of characters in a word):

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

By modelling the probability of word pairs, a further 2X in context length, even more realistic text emerges:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

In other words, an increase in the length of context leads to an improvement in the quality of text generated. Shannon remarks on the quality of his produced samples and conjectures that natural text samples may emerge from a sufficiently complex statistical model, “The particular sequence of ten words “attack on an English writer that the character of this” is not at all unreasonable. It appears then that a sufficiently complex stochastic process will give a satisfactory representation of a discrete source”.

One criticism of language modelling as a task for long-range reasoning is that models can capture a large portion of their predictions from the local context. Neural language models have traditionally ignored the wider context, focusing mostly on the short term. For example, in 2017 Dailuk et al. found their neural language model rarely attends beyond the preceding five words. However in the past year large Transformer models have been shown to make use of hundreds of words of context to generate ever-more realistic text with a longer range of coherence. A demo from OpenAI’s GPT-2, a 1.5B parameter Transformer, indicate that the model is able to generate realistic text and retain key entities (e.g. Dr Jorge Pérez and unicorns) across multiple paragraphs:

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez , an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez .

Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez , “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.

Transferring knowledge

Such samples would likely astound Shannon, 70 years on from his early language model experiments. However the real benefit of powerful neural language models – and their relevance to the goal of AGI – is their ability to transfer knowledge to a suite of tasks. In the process of learning how to model text, neural language models appear to build up a knowledge-base of associations, and a plethora of skills.

For instance, researchers at OpenAI showed that GPT-2 can be applied to natural-language processing tasks such as question answering, paraphrasing, or sentiment analysis with surprisingly good performance – especially for a model that has never been explicitly trained to perform such tasks. When large Transformer language models are fine-tuned on particular tasks such as question answering, the resulting performance is significantly better than models that were designed and trained solely for question answering. Google’s prominent natural language model, BERT, achieves state-of-the-art performance on a wide array of NLP benchmarks, and is now a part of Google Search. And more recently, it was shown that GPT-2 can learn to play rudimentary chess by training it on strings of game moves.

Benchmarking language models

A popular long-range language model benchmark is WikiText-103, which is comprised of English-language Wikipedia articles, and was developed by researchers at Salesforce AI. Articles are around 3,600 words on average, which, at the time of creation, was far beyond the memory window of state-of-the-art models.

However researchers at Google recently showed that a Transformer variant called the TransformerXL – which maintains a memory of past network activations and recently obtained state-of-the-art results on WikiText-103 – can make use of contexts spanning over one thousand words. This raises the question: will models soon saturate these benchmarks? As such, we’ve compiled and released a new, longer-range language model benchmark based on books.

A new dataset for long-term memory research

To support growing interest in long-range sequence models, we are releasing a new language modelling benchmark, PG-19, which is derived from books in the Project Gutenberg online library.

Books provide a rich context for the development of long-range memory models. We selected a subset of approximately 28,000 books from Project Gutenberg published before 1919. Unlike prior language modeling dataset releases, we apply very little pre-processing to the text. For example, we do not limit the vocabulary size of the data or censor numbers, to avoid the filtering of useful information.

PG-19 is over double the size of prior language modelling benchmarks, such as the Billion Word Benchmark, and contains text that is over 10X longer in context than the prior long-range language model benchmark, WikiText-103. We provide a comparative table of existing language modelling benchmarks, below: