Neural networks are widely used in complex tasks such as machine translation, image classification, or speech recognition. These networks are data driven, and as the amount of data increases so does network size and the computational complexity required for training and inference. Recently, Facebook AI Research (FAIR) researchers introduced a structured memory layer which can be easily integrated into a neural network to greatly expand network capacity and the number of parameters without significantly changing calculation cost. The approach is well-suited to natural language processing tasks, and the code has been open sourced on GitHub.

The memory is very large by design and therefore significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on product keys, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the overall system strike a better trade-off between prediction accuracy and computation efficiency both at training and test time. This memory layer allows us to tackle very large scale language modeling tasks. In our experiments we consider a dataset with up to 30 billion words, and we plug our memory layer in a state-of-the-art transformer-based architecture. In particular, we found that a memory augmented model with only 12 layers outperforms a baseline transformer model with 24 layers, while being twice faster at inference time. (Facebook AI Research).

Synced invited Ming Li, a professor at the University of Waterloo who focuses on deep learning, natural language processing and automated conversation; and Wei Yang, a Machine Learning Engineer at RSVP.ai, to share their thoughts on this FAIR research.

Why does this research matter?

The proposed memory layer with product keys makes it possible to increase the number of parameters while keeping the same computational budget, which lets the overall system strike a better trade-off between prediction accuracy and computation efficiency both at training and test time. Specifically, it is shown in the paper that a memory augmented model with only 12 layers outperforms a baseline transformer model with 24 layers, while being two times faster at the inference time.

What impact might this work bring to the field?

As the pre-trained large language models become more popular, an unpleasant trend is that small research groups cannot afford the computing resources required by these models. This paper tackles this problem by compressing the model without losing performance (and even reaching better performance). The code is released for reproducibility purposes, which will be of great benefit for the development of the NLP research community. Besides, it is easy to integrate the proposed structured memory within the popular transformer architecture, which is the basis of the most successful NLP language models like BERT, GPT-2, and XLNET. This shows the great potential to be applied in such models and thus benefit various downstream tasks.

Can you identify any bottlenecks in the research?

It is an interesting approach to plug a k-NN based memory layer in a state-of-the-art transformer-based architecture. However, only the language model task is evaluated in the paper. We hope more models (especially the state-of-the-art model) and more downstream tasks are evaluated to prove the generality of the proposed method.

While memory networks have been widely researched since they were introduced in 2015, to their credit the authors combined the previous work and applied it to state-of-the-art transformer-based architecture in this paper.

Can you predict any potential future developments related to this research?

An obvious future work area would be to apply this method to pre-trained language models such as BERT with tasks such as question answering, textual similarity, natural language inference, name entity recognition, etc, to see whether the performance still holds in these tasks after integrating the memory layer with product keys.

The paper Large Memory Layers with Product Keys is on arXiv.