The first time I learned about LSTMs, my eyes glazed over.

Not in a good, jelly donut kind of way.

It turns out LSTMs are a fairly simple extension to neural networks, and they’re behind a lot of the amazing achievements deep learning has made in the past few years. So I’ll try to present them as intuitively as possible – in such a way that you could have discovered them yourself.

But first, a picture:

Aren’t LSTMs beautiful? Let’s go.

(Note: if you’re already familiar with neural networks and LSTMs, skip to the middle – the first half of this post is a tutorial.)

Neural Networks

Imagine we have a sequence of images from a movie, and we want to label each image with an activity (is this a fight?, are the characters talking?, are the characters eating?).

How do we do this?

One way is to ignore the sequential nature of the images, and build a per-image classifier that considers each image in isolation. For example, given enough images and labels:

Our algorithm might first learn to detect low-level patterns like shapes and edges.

With more data, it might learn to combine these patterns into more complex ones, like faces (two circular things atop a triangular thing atop an oval thing) or cats.

And with even more data, it might learn to map these higher-level patterns into activities themselves (scenes with mouths, steaks, and forks are probably about eating).

This, then, is a deep neural network: it takes an image input, returns an activity output, and – just as we might learn to detect patterns in puppy behavior without knowing anything about dogs (after seeing enough corgis, we discover common characteristics like fluffy butts and drumstick legs; next, we learn advanced features like splooting) – in between it learns to represent images through hidden layers of representations.

Mathematically

I assume people are familiar with basic neural networks already, but let’s quickly review them.

A neural network with a single hidden layer takes as input a vector x, which we can think of as a set of neurons.

with a single hidden layer takes as input a vector x, which we can think of as a set of neurons. Each input neuron is connected to a hidden layer of neurons via a set of learned weights.

The jth hidden neuron outputs where is an activation function

where is an activation function The hidden layer is fully connected to an output layer, and the jth output neuron outputs . If we need probabilities, we can transform the output layer via a softmax function.

In matrix notation:

where

x is our input vector

W is a weight matrix connecting the input and hidden layers

V is a weight matrix connecting the hidden and output layers

Common activation functions for are the sigmoid function, , which squashes numbers into the range (0, 1); the hyperbolic tangent, , which squashes numbers into the range (-1, 1), and the rectified linear unit, .

Here’s a pictorial view:

(Note: to make the notation a little cleaner, I assume x and h each contain an extra bias neuron fixed at 1 for learning bias weights.)

Remembering Information with RNNs

Ignoring the sequential aspect of the movie images is pretty ML 101, though. If we see a scene of a beach, we should boost beach activities in future frames: an image of someone in the water should probably be labeled swimming, not bathing, and an image of someone lying with their eyes closed is probably suntanning. If we remember that Bob just arrived at a supermarket, then even without any distinctive supermarket features, an image of Bob holding a slab of bacon should probably be categorized as shopping instead of cooking.

So what we’d like is to let our model track the state of the world:

After seeing each image, the model outputs a label and also updates its knowledge of the world. For example, the model might learn to automatically discover and track information like location (are scenes currently in a house or beach?), time of day (if a scene contains an image of the moon, the model should remember that it’s nighttime), and within-movie progress (is this image the first frame or the 100th?). Importantly, just as a neural network automatically discovers hidden patterns like edges, shapes, and faces without being fed them, our model should automatically discover useful information by itself. When given a new image, the model should incorporate the knowledge it’s gathered to do a better job.

This, then, is a recurrent neural network. Instead of simply taking an image and returning an activity, an RNN also maintains internal memories about the world (weights assigned to different pieces of information) to help perform its classifications.

Mathematically

So let’s add the notion of internal knowledge to our equations, which we can think of as memories or pieces of information that the network maintains over time.

But this is easy: we know that the hidden layers of neural networks already encode useful information about their inputs, so why not use these layers as memory? This gives us our RNN equations:





Note that the hidden state computed at time ( , our internal knowledge) is fed back at the next time step. (Also, I’ll use concepts like hidden state, knowledge, memories, and beliefs to describe interchangeably.)

Longer Memories through LSTMs

Let’s think about how our model updates its knowledge of the world. So far, we’ve placed no constraints on this update, so its knowledge can change pretty chaotically: at one frame it thinks the characters are in the US, at the next frame it sees the characters eating sushi and thinks they’re in Japan, and at the next frame it sees polar bears and thinks they’re on Hydra Island. Or perhaps it has a wealth of information to suggest that Alice is an investment analyst, but decides she’s a professional assassin after seeing her cook.

This chaos means information quickly transforms and vanishes, and it’s difficult for the model to keep a long-term memory. So what we’d like is for the network to learn how to update its beliefs (scenes without Bob shouldn’t change Bob-related information, scenes with Alice should focus on gathering details about her), in a way that its knowledge of the world evolves more gently.

This is how we do it.

Adding a forgetting mechanism. If a scene ends, for example, the model should forget the current scene location, the time of day, and reset any scene-specific information; however, if a character dies in the scene, it should continue remembering that he’s no longer alive. Thus, we want the model to learn a separate forgetting/remembering mechanism: when new inputs come in, it needs to know which beliefs to keep or throw away. Adding a saving mechanism. When the model sees a new image, it needs to learn whether any information about the image is worth using and saving. Maybe your mom sent you an article about the Kardashians, but who cares? So when new a input comes in, the model first forgets any long-term information it decides it no longer needs. Then it learns which parts of the new input are worth using, and saves them into its long-term memory. Focusing long-term memory into working memory. Finally, the model needs to learn which parts of its long-term memory are immediately useful. For example, Bob’s age may be a useful piece of information to keep in the long term (children are more likely to be crawling, adults are more likely to be working), but is probably irrelevant if he’s not in the current scene. So instead of using the full long-term memory all the time, it learns which parts to focus on instead.

This, then, is an long short-term memory network. Whereas an RNN can overwrite its memory at each time step in a fairly uncontrolled fashion, an LSTM transforms its memory in a very precise way: by using specific learning mechanisms for which pieces of information to remember, which to update, and which to pay attention to. This helps it keep track of information over longer periods of time.

Mathematically

Let’s describe the LSTM additions mathematically.

At time , we receive a new input . We also have our long-term and working memories passed on from the previous time step, and both n-length vectors), which we want to update.

We’ll start with our long-term memory. First, we need to know which pieces of long-term memory to continue remembering and which to discard, so we want to use the new input and our working memory to learn a remember gate of n numbers between 0 and 1, each of which determines how much of a long-term memory element to keep. (A 1 means to keep it, a 0 means to forget it entirely.)

Naturally, we can use a small neural network to learn this remember gate:

(Notice the similarity to our previous network equations; this is just a shallow neural network. Also, we use a sigmoid activation because we need numbers between 0 and 1.) Next, we need to compute the information we can learn from , i.e., a candidate addition to our long-term memory:

is an activation function, commonly chosen to be .

Before we add the candidate into our memory, though, we want to learn which parts of it are actually worth using and saving:

(Think of what happens when you read something on the web. While a news article might contain information about Hillary, you should ignore it if the source is Breitbart.)

Let’s now combine all these steps. After forgetting memories we don’t think we’ll ever need again and saving useful pieces of incoming information, we have our updated long-term memory:

where denotes element-wise multiplication.

Next, let’s update our working memory. We want to learn how to focus our long-term memory into information that will be immediately useful. (Put differently, we want to learn what to move from an external hard drive onto our working laptop.) So we learn a focus/attention vector:

Our working memory is then

In other words, we pay full attention to elements where the focus is 1, and ignore elements where the focus is 0.

And we’re done! Hopefully this made it into your long-term memory as well. To summarize, whereas a vanilla RNN uses one equation to update its hidden state/memory:

An LSTM uses several:

where each memory/attention sub-mechanism is just a mini brain of its own:

(Note: the terminology and variable names I’ve been using are different from the usual literature. Here are the standard names, which I’ll use interchangeably from now on:

The long-term memory, , is usually called the cell state , denoted .

, is usually called the , denoted . The working memory, , is usually called the hidden state , denoted . This is analogous to the hidden state in vanilla RNNs.

, is usually called the , denoted . This is analogous to the hidden state in vanilla RNNs. The remember vector, , is usually called the forget gate (despite the fact that a 1 in the forget gate still means to keep the memory and a 0 still means to forget it), denoted .

, is usually called the (despite the fact that a 1 in the forget gate still means to keep the memory and a 0 still means to forget it), denoted . The save vector, , is usually called the input gate (as it determines how much of the input to let into the cell state), denoted .

, is usually called the (as it determines how much of the input to let into the cell state), denoted . The focus vector, , is usually called the output gate, denoted .

Neural Networks

I could have caught a hundred Pidgeys in the time it took me to write this post, so here’s a cartoon.

Recurrent Neural Networks

LSTMs

Learning to Code

Let’s look at a few examples of what an LSTM can do. Following Andrej Karpathy’s terrific post, I’ll use character-level LSTM models that are fed sequences of characters and trained to predict the next character in the sequence.

While this may seem a bit toyish, character-level models can actually be very useful, even on top of word models. For example:

Imagine a code autocompleter smart enough to allow you to program on your phone. An LSTM could (in theory) track the return type of the method you’re currently in, and better suggest which variable to return; it could also know without compiling whether you’ve made a bug by returning the wrong type.

An LSTM could (in theory) track the return type of the method you’re currently in, and better suggest which variable to return; it could also know without compiling whether you’ve made a bug by returning the wrong type. NLP applications like machine translation often have trouble dealing with rare terms. How do you translate a word you’ve never seen before, or convert adjectives to adverbs? Even if you know what a tweet means, how do you generate a new hashtag to capture it? Character models can daydream new terms, so this is another area with interesting applications.

So to start, I spun up an EC2 p2.xlarge spot instance, and trained a 3-layer LSTM on the Apache Commons Lang codebase. Head over to Pastebin to check out the program it generates after a few hours.

While the code certainly isn’t perfect, it’s better than a lot of data scientists I know. And we can see that the LSTM has learned a lot of interesting (and correct!) coding behavior:

It knows how to structure classes: a license up top, followed by packages and imports, followed by comments and a class definition, followed by variables and methods. Similarly, it knows how to create methods: comments follow the correct orders (description, then @param, then @return, etc.), decorators are properly placed, and non-void methods end with appropriate return statements. Crucially, this behavior spans long ranges of code – see how giant the blocks are!

a license up top, followed by packages and imports, followed by comments and a class definition, followed by variables and methods. Similarly, it knows how to create methods: comments follow the correct orders (description, then @param, then @return, etc.), decorators are properly placed, and non-void methods end with appropriate return statements. Crucially, this behavior spans long ranges of code – see how giant the blocks are! It can also track subroutines and nesting levels: indentation is always correct, and if statements and for loops are always closed out.

indentation is always correct, and if statements and for loops are always closed out. It even knows how to create tests.

How does the model do this? Let’s look at a few of the hidden states.

Here’s a neuron that seems to track the code’s outer level of indentation (each character is colored by the neuron’s state when reading the character as input, i.e., when trying to generate the next character; red cells are negative, and blue cells are positive):

And here’s a neuron that counts down the spaces between tabs:

For kicks, here’s the output of a different 3-layer LSTM trained on TensorFlow’s codebase:

"""Tests for softplus layer tests.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import numpy as np from tensorflow.python.platform import test class InvalidAllOpCost(Experiment): def _runTestToIndForDead(self): return self._divs() def testPad(self): with ops.Graph().as_default(): var = sess.run(bucketized_op) self.assertAllClose( list(variables.global_variables()), status.eval()) def testHttptimenaterRoutingOptimizerSize(self): with self.test_session() as sess: table = lookup_ops.IdTableWithHashBuckets( keys=['id', 'z'], example_id_column='price', num_outputs=6, input_columns=['dummy_range', 'feature', 'dimensions']) with self.assertRaisesRegexp(ValueError, 'Expected dict of rank dimensions'): fc.numeric_column('aaa', indices=[[0, 0], [1, 0]], dtype=dtypes.int64) output = table.lookup(input_string) # all input tensors in SparseColumn has dimensions [end_back_prob, dimension] in the format. with self.assertRaisesRegexp( TypeError, "Shape of values must be specified during training."): fc.bucketized_column(attrs, boundaries=[62, 62])

There are plenty of other fun examples floating around the web, so check them out if you want to see more.

Still with me? If your eyes haven’t glazed over yet, head over to Exploring LSTMs: Part Two to investigate LSTM internals.