Recurrent Neural Networks (RNNs) have become a major player in sequence modeling over the past few years. Due to their efficiency and flexibility, it is tempting to think that they will constitute the future of sequence learning. However, research in Deep Learning is moving at such a fast pace that no one knows what that future is going to be made of.

In fact, among the recent innovations in Deep Learning, the idea of adding some form of external memory to existing architectures has led to promising results, especially in Natural Language Processing (NLP). For the most part, these memory-augmented networks have been motivated by the need to reproduce the notion of working memory theorized by neurosciences, which is responsible for inductive reasoning and the creation of new concepts.

Here, we will focus on one of the earliest examples of memory-augmented neural networks: the Neural Turing Machines (Graves, Wayne & Danihelka, 2014). We will see in details how they work, as well as how they can be suitable for learning algorithmic tasks.

Along with this blog post, Snips is also open sourcing NTM-Lasagne on Github, our Neural Turing Machines library built with Theano, on top of the excellent Lasagne. It includes code, pre-trained models, as well as all the examples covered in this post.

Memory-augmented Neural Networks

A feature that all neural networks share, whether they are recurrent or feed-forward, is that they map data into an abstract representation. However, adding knowledge from the external environment, like a whole thesaurus in NLP, is very hard.

Rather than artificially increasing the size of the hidden state in the RNN, we would like to arbitrarily increase the amount of knowledge we add to the model while making minimal changes to the model itself. Basically, we can augment the model with an independent memory that acts as a knowledge base from which the network can read and write on demand. You can think of the neural network as the processing unit (CPU), and this new external memory as the RAM.

The choice of the structure for the memory is crucial to keep the read/write operations fast, no matter the size of the memory. Multiple designs have been proposed by the Deep Learning community. For example the Neural Turing Machines, as well as the Neural Random Access Machines (Kurach & Andrychowicz et al., 2015) and the Neural GPU (Kaiser et al., 2015), use a tape with read and write heads. In Grefenstette et al., 2015, the authors use continuous versions of structures like stacks or (double-ended) queues.

An interesting side-effect of these memory-augmented networks is the ability to keep track of intermediate computations. For instance, in the case of Question-Answering (QA) problems in NLP, it can be valuable to memorize a story as the model reads it to eventually answer a question. The Memory Networks (Sukhbaatar et al., 2015) and the Dynamic Memory Networks (Kumar et al., 2015), for example, take advantage of the memory to perform well on QA tasks.

The idea of adding extra memory to RNNs is not new. In fact, the Long Short Term Memory networks (LSTMs, Hochreiter & Schmidhuber, 1997) already have a basic memory cell on which information is stored at every time-step. For an introduction to LSTMs, I recommend reading this great post by Chris Olah that details step-by-step how they work, including the role of this memory cell. To make it short, the main function of this cell is to simplify the learning of RNNs, and maintain long term dependencies between elements in the input sequence.

So, what is the point of adding a memory with such low-level addressing mechanisms, compared to LSTMs? The answer is structure.

LSTMs typically have a distributed representation of information in memory, and perform global changes across the whole memory cell at each step. This is not an issue in itself, and experience has shown that they sure can memorize some structure out of the data.

Unlike LSTMs, memory-augmented networks encourage (but don’t necessarily ensure) local changes in memory. This happens to help not only to find the structure in the training data, but also to generalize to sequences that are beyond the generalization power of LSTMs, such as longer sequences in algorithmic tasks (we will see some examples below).

You can picture the value of memory-augmented networks over LSTMs through the idea of the cocktail party effect: imagine that you are at a party, trying to figure out what is the name of the host while listening to all the guests at the same time. Some may know his first name, some may know his last name; it could even be to the point where guests know only parts of his first/last name. In the end, just like with a LSTM, you could retrieve this information by coupling the signals from all the different guests. But you can imagine that it would be a lot easier if a single guest knew the full name of the host to begin with.

Neural Turing Machines

In their respective domains, Computability Theory and Machine Learning have helped pushing forward the abilities of computers. The former defined what a computer is capable of computing, whereas the latter allowed computers to perform tasks that are easy for humans but were long thought to be impossible for computers, like computer vision for example.

Alan Turing himself studied the strong connections between computing and prospects of Artificial Intelligence in the 1940s. His Turing Machine is a classical computational model that operates on an infinite memory tape through a head that either reads or writes symbols. Although this abstract computer has a very low level interface, it is powerful enough to simulate any algorithm. The Neural Turing Machine (NTM) takes inspiration from these two fields to make the best of both worlds.

A Neural Turing Machine unfolded in time. The controller (green), a feed-forward neural network, depends on both the input x and the read vector r. It emits read (blue) and write (red) heads to interact with the memory M.

It is worth noting that there is an interesting connection between Turing Machines and RNNs. The latter are known to be Turing-complete (Siegelmann, 1995). It means that just like with Turing Machines, any algorithm can be encoded by a RNN with carefully hand-picked parameters (such parameters may not be learnable from data, though).

The NTM can be seen as a differentiable version of the Turing Machine. Similarly to a Turing Machine, it has two main components: a (bounded) memory tape, and a controller that is responsible for making the interface between the external world (ie. the input sequence and the output representation) and the memory through read and write heads. This architecture is said to be differentiable, in the sense that both the controller and the addressing mechanisms (the heads) are differentiable. The parameters of the model can then be learned using Stochastic Gradient Descent (SGD). Let’s describe these components in more details.

Controller

The controller is a neural network that provides the internal representation of the input that is used by the read and write heads to interact with the memory. Note that this inner representation is not identical to the one that is eventually stored in memory, though the latter is a function of this representation.

The type of the controller represents the most significant architectural choice for a Neural Turing Machine. This controller can be either a feed-forward, or recurrent neural network. A feed-forward controller has the advantage over a recurrent controller to be faster, and offers more transparency. This comes at the cost of a lower expressive power though, as it limits the type of computations the NTM can perform per time-step.

Another example of Neural Turing Machine, where the controller is an LSTM.

Read / Write mechanisms

The read and write heads make the Neural Turing Machines particularly interesting. They are the only components to ever interact directly with the memory. Internally, the behavior of each head is controlled by its own weight vector that gets updated at every time-step. Each weight in this vector corresponds to the degree of interaction with each location in memory (the weight vector sums to 1). A weight of 1 focuses all the attention of the NTM only on the corresponding memory location. A weight of 0 discards that memory location.

Parameters for the weight updates. These parameters are specific to each head (5 parameters per head). Each box corresponds to a 1-layer neural network with the corresponding activation function.

Moreover, we would like these weights to follow two requirements: they should support local changes (read or write) on the memory while keeping their updates differentiable, as we want to train the NTM end to end. To this end, the weight vector gets updated through a series of intermediate smooth operations.

There is a total of four operations per update: content addressing, interpolation, convolutional shift and sharpening. They all depend on parameters produced by the controller. More precisely, these parameters are functions of the hidden state emitted by the controller.