It seems like most of our posts on this blog start with “We’re back!”, so… you know the drill. It’s been a while since our last post — just over 5 months — but it certainly doesn’t feel that way. Whether our articles are more spaced out than we’d like them to be, well, we haven’t actually discussed that yet. But I, Rohan, would definitely like to get into a more frequent routine. Since November, we’ve been grinding on school (basically, getting it over and done with), banging out Contra v2, and lazing around more than we should. End of senior year is a fun time.

It’s 2017. We started A Year Of AI in 2016. Last year. Don’t panic, though. If you’ve read our letter, you’ll know that, despite our name and inception date, we’re not going anywhere anytime soon. There’s a good chance we’ll move off Medium, but we’re still both obsessed with AI and writing these posts to hopefully make other people obsessed, as well.

I wrote the first article on this blog just over a year ago, and mentioned that my goal for the year was to be accepted into Stanford University as an undergrad student. A few months ago, I achieved this goal. At Stanford, I’ll probably be studying Symbolic Systems, which is a program that explores both the humanities and STEM to inform an understanding of artificial intelligence and the nature of minds. Needless to say, A Year of AI will continue to document the new things I learn 😀.

Anyways, you can find plenty of articles on recurrent neural networks (RNNs) online. My favorite one, personally, is from Andrej Karpathy’s blog. I read it about 1.5 years ago when I was learning about RNNs. We definitely think there’s space to simplify the topic even more, though. As usual, that’s our aim for the article — to teach you RNNs in a fun, simple manner. We’re also importantly doing this for completion purposes; we want people to hop onto A Year of AI and be able to work their way up all the way from logistic regression to neural machine translation (don’t worry, you’ll find out what means soon enough), and thus recurrent neural networks is a vital addition. After this, we want to look at and summarize/simplify a bunch of new super interesting research papers, and for most of them RNNs are a key ingredient. Finally, we think this article contains so much meat and ties together content unlike any other RNN tutorial on the interwebs.

Before we get started, you should try to familiarize yourself with “vanilla” neural networks. If you need a refresher, check out our neural networks and backpropogation mega-post from earlier this year. This is so you know the basics of machine learning, linear algebra, neural network architecture, cost functions, optimization methods, training/test sets, activation functions/what they do, softmax, etc. Reading our article on convolutional neural networks may also make you more comfortable entering this post, especially because we often reference CNNs. Checking out this article I wrote on vanishing gradients will help later on, as well.

Rule of thumb: the more you know, the better!

Table of Contents

I can’t link to each section, but here’s what we cover in this article (save the intro and conclusion):

What can RNNs do? Where we look at… what RNNs can do! Why? Where we talk about the gap that RNNs fill in machine learning’s suite of algorithms. Show me. Where we visualize RNNs for the first time. Formalism. Where we walk through how an RNN mathematically works with proper notation. An example? Okay! Where we walk through, qualitatively, a simple application of RNNs and how the RNN operates in this application, including techniques we can use. Training (or, why vanilla RNNs suck.) Where we talk about how to train RNNs, and why vanilla RNNs are bad at learning. Fixing the problem with LSTMs (Part I). Where we introduce the solution to vanilla RNNs’ inability to learn: LSTMs. Fixing the problem with LSTMs (Part II). Where we analyze on a close, technical level, the reasons LSTMs don’t suffer from vanishing gradients as much (and why they still do, to an extent). Then we conclude LSTMs with final thoughts on and facts about them. Yay RNNs! Where you get to see neat little things RNNs have done! In Practice. Where we look at more technical and important applications and case studies of RNNs, including other variations of RNNs, especially as relevant in hot/recent research papers. Building a Vanilla Recurrent Neural Network. Where you get to code your very first RNN! Woohoo!

What can RNNs do?

There are a number of very important tasks that ANNs and CNNs cannot solve, that RNNs are used for instead. Tasks like: image captioning, language translation, sentiment classification, predictive typing, video classification, natural language processing, speech recognition, and a lot more interesting things that have been presented in recent research papers (for example… learning to learn by gradient descent by gradient descent!).

Image captioning, taken from CS231n slides: http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf

RNNs are very powerful. Y’know how regular neural networks have been proved to be “universal function approximators” ? If you didn’t:

In the mathematical theory of artificial neural networks, the universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of R^n, under mild assumptions on the activation function.

That’s pretty confusing. Basically, what this states is that an artificial neural network can compute any function. Even if someone gives you an extremely wiggly, complex looking function, it’s guaranteed that there exists a neural network that can produce (or at least extremely closely approximate) it. The proof itself is very complex, but this is a brilliant article offering a visual approach as to why it’s true.

So, that’s great. ANNs are universal function approximators. RNNs take it a step further, though; they can compute/describe programs. In fact, some RNNs with proper weights and architecture qualify as Turing Complete:

A Turing Complete system means a system in which a program can be written that will find an answer (although with no guarantees regarding runtime or memory). So, if somebody says “my new thing is Turing Complete” that means in principle (although often not in practice) it could be used to solve any computation problem. — http://stackoverflow.com/a/7320/1260708

That’s cool, isn’t it? Now, this is all theoretical, and in practice means less than you think, so don’t get too hyped. Hopefully, though, this gives some more insight into why RNNs are super important for future developments in machine learning — and why you should read on.

At this point, if you weren’t previously hooked on learning what the heck these things are, you should be now. (If you still aren’t, just bare with me. Things will get spicy soon.) So, let’s dive in.

Why?

We took a bit of a detour to talk about how great RNNs are, but haven’t focused on why ANNs can’t perform well in the tasks that RNNs can.

Why do we need another neural network model? Why do we need recurrent neural networks when we already have the beloved ANNs (and CNNs) in all their glory?

It boils down to a few things:

ANNs can’t deal with sequential or “temporal” data

ANNs lack memory

ANNs have a fixed architecture

RNNs are more “biologically realistic” because of the recurrent connectivity found in the visual cortex of the brain

Let’s address the first three points individually. The first issue refers to the fact that ANNs have a fixed input size and a fixed output size. ANNs have an elaborate list of hyperparameters, and this notably includes the number of neurons in the input layer and output layer. But what if we wanted input data and/or output data of variable size, instead of something that needs to have its size as a preset constant? RNNs allow us to do that. In this aspect, they offer more flexibility than ANNs.

We might choose this architecture for our ANN, with 4 inputs and 1 output. But that’s it — we can’t input a vector with 5 values, for example. https://qph.ec.quoracdn.net/main-qimg-050d12c5da82f4f97fdd942d7777b8e4.

I’ll give you a couple examples of why this matters.

It’s unclear how we could use an ANN by itself to perform a task like image captioning, because the network would need to output a sentence — a list of words in a specific order — which is a sequence. It would be a sequence of vectors, because each word would need to be represented numerically. In machine learning and data science, we represent words numerically as vectors; these are called word embeddings. An ANN can only output a single word/label, like in image classification where we treat the output as the label with the highest value in the final vector that is a softmax probability distribution over all classes. The only way to make sentences work with ANNs would be to have billions of output neurons that each map to a single possible sentence in the permutation of all [sensible] sentences that can be formed by the vocabulary we have. And that doesn’t sound like a good idea.

A reminder of what the output of an ANN looks like — a probability distribution over classes — and how we convert that into a single final result (one-hot encoding): by taking the label with the greatest probability and making it 1, with the rest 0.

Wow, that was a lot of words. Nevertheless, I hope it’s clear that, with ANNs, there’s no feasible way to output a sequence.

Now, what about inputting a sequence into an ANN? In other words, “temporal” data: data that varies over time, and is thus a sequence. Take the example of sentiment classification where we input a sentence (sequence of words = sequence of vectors = sequence of set of values where each value goes into an individual neuron) and want to output its sentiment: positive or negative. The output part seems easy, because it’s just one neuron that’s either rounded to 1 (positive) or 0 (negative). And, for the input, you might be thinking: couldn’t we input each “set of values” separately? Input the first word, wait for the neural net to fully feed forward and produce an output, then input the next word, etc. etc.

Let’s take the case of this utterly false, and most certainly negative sentence, to evaluate:

This is just an alternative fact, believe me! Lenny is actually a great coder. The best I know of. The best.

We’d input “Lenny” first, then “Khazan”, then “is”, etc. But, at each feedforward iteration, the output would be completely useless. Why? Because the output would be dependent on only that word. We’d be finding the sentiment of a single word, which is useless, because we want the sentiment of the entire sentence. Sentiment analysis only makes sense when all the words come together, dependent on each other, to form a sentence.

Think of it this way — this means you’re essentially running a neural network a bunch of times, just with new data at each separate iteration. Those run-throughs aren’t linked in any way; they’re independent. Once you feedforward and fully run the neural network, it forgets everything it just did. This sentence only makes sense and can only be interpretable because it’s a collection of words put together in a specific order to form meaning. The relevance of each word is dependent on the words that precede it: the context. This is why RNNs are being used heavily in NLP; they retain context by having memory. ANNs have no memory.

I like this quote from another article on RNNs:

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence. — http://colah.github.io/posts/2015–08-Understanding-LSTMs/

(Furthermore, take the case where we had sequential data in both the input and the output. Translating one language to another is a good example of this. Clearly, ANNs aren’t the answer.)

RNNs don’t just need memory; they need long term memory. Let’s take the example of predictive typing. Let’s say we typed the following sentence in an SMS message to 911, and the operating system needs to fill in the blank:

The face of a criminal?

Here, if the RNN wasn’t able to look back much (ie. before “should”), then many different options could arise:

Lenny in the military? Make it into a TV show! I’d watch it.

The word “sent” would indicate to the RNN that a location needs to be outputted. However, if the RNN was able to retain information from all the way back, such as the word “criminal”, then it would be much more confident that:

The probability of outputting “jail” drastically increases when it sees the word “criminal” is present. That’s why context matters, be it predictive typing, image captioning, machine translation, etc. The output or outputs of a recurrent neural network will always be functionally dependent on (meaning, a function of) information from the very beginning, but how much it chooses to “forget” or “retain” (that is, varying degrees of influence from earlier information) depends on the weights that it learns from the training data.

As it turns out, RNNs — especially deep ones — are rarely good at retaining much information, due to an issue called the vanishing gradient problem. That’s where we turn to other variants of RNNs such as LSTMs and GRUs. But, more on that later.

To address the third point, one more constraint with ANNs is that they have a fixed number of computation/processing steps (because, once again, the number of hidden layers is a hyperparameter). With RNNs, we can have much more dynamic processing since we operate over vectors. Each neuron in an RNN is almost like an entire layer in an ANN; this will make more sense as we bring up an illustration for you. Exciting stuff.

Show me.

OK, that’s enough teasing. Three sections into the article, and you’re yet to see what an RNN looks like, or appreciate how it really works. Everything comes in due time, though!

The first thing I’m going to do is show you what a normal ANN diagram looks like:

Each neuron stores a single scalar value. Thus, each layer can be considered a vector.

Now I’m going to show you what this ANN looks like in our RNN visual notation:

The two diagrams above represent the same thing. The latter, obviously, looks more succinct than the former. That’s because, with our RNN visual notation, each neuron (inputs, hidden(s), and outputs) contains a vector of information. The term “cell” is also used, and is interchangeable with neuron. (I’ll use the latter instead of the former.) Red is the input neuron, blue is the hidden neuron, and green is the output neuron. Therefore, an entire ANN layer is encapsulated into one neuron with our RNN illustration. All operations in RNNs, like the mapping from one neuron’s state to another, are over entire vectors, compared to individual scalars that are summed up with ANNs.

Let’s flip it the other way:

This is in fact a type of recurrent neural network — a one to one recurrent net, because it maps one input to one output. A one to one recurrent net is equivalent to an artificial neural net.

We can have a one to many recurrent net, where one input is mapped to multiple outputs. An example of this would be image captioning — the input would be the image in some processed form (usually the result of a CNN analyzing the image), and the output would be a sequence of words. Such an RNN may look like this:

Changed the shades of the green nodes… hope that’s OK!

This may be confusing at first, so I’m going to make sure I walk slowly through it. On the x-axis we have time, and on the y-axis we have depth/layers:

When I refer to “time” on the x-axis, I’m referring to the order at which these operations occur. Time could also be literal for temporal data, where the input is a sequence. When I say “depth” on the y-axis, I’m referring to the mapping from the input layer, to the hidden layer(s), to the output layer, where layer number and thus depth increases.

It may look like we have seven neurons now, but we still have three: one input neuron, one hidden neuron, and one output neuron. The difference is that these neurons now experience multiple “timesteps” where they take on different values, which are, again, vectors. The input neuron in our example above doesn’t, because it’s not representing sequential data (one to many), but for other architectures it could.

The hidden neuron will take on the vector value h_1 first, then h_2, and finally h_3. At each timestep, the hidden neuron’s vector h_t is a function of the vector at the previous timestep h_t-1, except for h_1 which is dependent only on the input x_1. In the diagram above, each hidden vector then gives rise to an output y_t, and this is how we map one input to multiple outputs. You can visualize these functional dependencies with the arrows, which illustrates flow of information in the network.

As we progress on the x-axis, the current timestep increases. As we progress on the y-axis, the neuron in question changes. Each point on this graph thus represents one neuron — be it input, hidden, or output — at some timestep, being fed information from a neuron (be it itself or another) at the previous timestep.

The RNN would execute like so:

Input x_1 Compute h_1 based on x_1 (the arrow implies functional dependency) Compute h_2 based on h_1 Compute h_3 based on h_2 Compute y_1 based on h_1 Compute y_2 based on h_2 Compute y_3 based on h_3

You could compute y_t either immediately after h_t has been computed, or, like above, compute all outputs once all hidden states have been computed. I’m not entirely sure which is more common in practice.

This allows for more complex and interesting networks than ANNs because we can have as many timesteps as we want.

The value of the output neuron at each timestep represents a word in the sentence, in the order the sentence will be constructed. The caption this RNN produces is hence 3 words long. (It’s actually 2, because the RNN would need to output a period or <END> marker at the final timestep, but we’ll get into that later.)

In case you don’t understand yet exactly why RNNs work, I’ll walk through how these functional dependencies come to fruition when you apply it to a one to many scenario such as image captioning.

Lenny and I on student scholarship at WWDC 2013. Good times!

When you combine an RNN and CNN, you — in practice — get an “LCRN”. The architecture for LCRNs are more complex than what I’m going to present in the next paragraph; rather, I’m going to simplify it to convey my point. We’ll actually get fully into how they work later.

Imagine an RNN tries to caption this image. An accurate result might be:

Two people happily posing for a photo inside a building.

The input to the RNN would be the output of a CNN that processes this image. (However, to be pedantic, it would be the output of the CNN without a classification/softmax layer — that is, pulled from the final fully connected layer.) The CNN might pick up on the fact that there are two primary human face-like objects present in the image, which, paired with what the RNN has learned via training, may induce the first hidden state¹ of the recurrent neural network to be one where the most likely candidate word is “two”.

Pro-tip¹: The term “hidden state” refers to the vector of a hidden neuron at a given timestep. “First hidden state” refers to the hidden state at timestep 1.

The first output, which represents the word “two”, was functionally dependent on the first hidden state, which in itself was a function of the input to the RNN. Thus, “two” was ultimately determined from the information that the CNN gave us and the experience/weights of the RNN. Now, the second word, “people”, is functionally dependent on the second hidden state. However, note that the second hidden state is just a function of the first hidden state. This means that the word “people” was the most likely candidate given the hidden state where “two” was likely. In other words, the RNN recognized that, given the word “two”, the word “people” should be next, based on the RNN’s experience from training and the initial image [analysis] we inputted.

The same will occur for every following word; the nth word will be based on the nth hidden state, which, ultimately, is a function of every hidden state before it, and thus could be interpreted purely as an extremely complex and layered function of the input. The weights do the heavy lifting by making sense of all this information and deducing an output from it.

To put it bluntly, you can boil down what the RNN is “thinking” to this:

Based on what I’ve seen from the input, based on the current timestep I’m at, and based on what I know from all my training, I need to output: “x”.

Thus, each outputted word is dependent on the words before it, all the way back to the input image data. However, this relationship is indirect. It’s indirect because the outputs are only dependent on the hidden states, not on each other (ie. the RNN doesn’t deduce “people” from “two”, it deduces “people”, partly, from the information — the hidden state — that gave rise to “two”). In LCRNs, though, this is explicit instead of implicit; we “sample” the output of one timestep by taking it and literally feeding it back as input into the next timestep. In a sense, LCRNs can hence be interpreted as having many to many architecture.

The exact quantitative relationships depend on the RNN’s weights. But, generally, this is the concept of memory in play. Creating a coherent sentence as we go along is only really possible if we can recall what we said before. And RNNs are able to do exactly that; they remember what they said before and figure out, based on their image captioning expertise, what from this is useful to continue accurately speaking.

Yep, I went to France for a holiday. And I actually learned to speak some <wait, shit, what was the language again? oh yea, “France”…> French!

Obviously, an RNN needs to be trained and have proper weights for this to all function properly. RNNs aren’t magic; they only work because trained networks identified and learned patterns in data during training time that they now look for during prediction.

Perhaps this was a bit over-explaining on my part, but hopefully I nailed down some important and core ideas about how RNNs function.

So far we’ve looked at one to one and one to many recurrent networks. We can also have many to one:

With many to one (and many to many), the input is in the form of a sequence, and so the hidden states are functionally dependent on both the input at that timestep and the previous hidden state. This is different to one to many, where the hidden state after h_1 is only dependent on the previous hidden state. That’s why, in the image above, the second hidden state has two arrows directed at it.

Only one output exists in many to one architecture. An example application is sentiment classification, where the input is a sentence (sequence of words) and the output is a probability indicating that the inputted sentence was positive.

The final type of recurrent net is many to many, where both the input and output are sequential:

A use case would be machine translation where a sequence of words in one language needs to be translated to a sequence of words in another.

We can also go deeper and have multiple hidden layers, and/or a greater number of timesteps:

We’re getting deeper and deeper!

Really, this could be considered as multiple RNNs. Technically, you can consider each “hidden layer” as an RNN itself, given each neuron operates on vectors and updates through time; in ANN context, that volume of operations would be considered an entire network. So this is like stacking RNNs on top of each other. However, in this article I’ll refer to it as multiple hidden layers; different papers and lecturers may take different approaches.

When we have many timesteps (usually hundreds) and multiple hidden layers, the architecture of the network becomes much more complex and interesting. One feature of this RNN, in particular, is that all the outputs, including the first, depend on not just the input up to that timestep, but all of the inputs. (You can see this because the green neuron is only introduced after the final input timestep.) If this RNN was to translate English to Chinese, the first word of translated Chinese isn’t just dependent on the first word of the inputted English; it’s dependent on the entire sentence.

One way to demonstrate why this matters is to use Google Translate:

One of my favorite Green Day lyrics, from the song “Fashion Victim” on WARNING:. Side-note: Based on my experience with Google Translate in Chinese class over the last 8 years, this translation is probably off.

Now I’ll input “He’s a victim” and “of his own time” separately. You’ll notice that when you join the two translated outputs, this won’t be equal to the corresponding phrase in the first translation:

What happens if we break up the English into different parts, translate, and join together the translated Chinese parts?

They’re not equal.

What gives? Well, the way sentences are constructed in languages can differ in varying scenarios. Some words in English may also map to multiple different words in Chinese, depending on how it’s used. It all depends on the context and the entire sentence as a whole — the meaning you’re trying to convey. This is the exact approach a human translator would take.

Another type of many to many architecture exists where each neuron has a state at every timestep, in a “synchronized” fashion. Here, each output is only dependent on the inputs that were fed in during or before it. Because of this, synchronized many to many probably wouldn’t be suitable for translation.

An application for this could be video classification where each frame needs to be mapped to some sort of class or label. Interesting note — an RNN is better at this task than CNNs are because what’s going on in a scene is much easier to understand if you’ve watched the video up to that point and thus can contextualize it. That’s what humans do!

Quick note: we can “wrap” the RNN into a much more succinct form, where we collapse the depth and time properties, like so:

This notation demonstrates that RNNs take input, process that input through multiple timesteps and hidden layers, and produce output. The arrow both leaving and entering the RNN conveys that an RNN hidden state is functionally dependent on the hidden state at the preceding timestep; it’s sort of like a loop that feeds itself.

When you ever read about “unrolling” an RNN into a feedforward network that looks like it’s in the same collapsed format as the diagram above, this means we expand it to show all timesteps and hidden layers like we did before.

Another quick note: when somebody or a research paper mentions that they are using “512 RNN units”, this translates to: “1 RNN neuron that outputs a 512-wide vector”; that is, a vector with 512 values. At first, I thought this meant that maybe at each timestep there were 512 separate neurons somehow working in conjunction, but nope, it’s luckily much simpler than that… albeit strangely worded.

Furthermore, one “RNN unit” usually refers to an RNN with one hidden layer; thus, instead of defining RNN as something that is multilayer inherently, we often see people use the phrase like: “stacking RNNs on top of each other”. Each RNN will have its on weights, but connecting them gives rise to an overarching multilayer RNN. In this article, we treat recurrent neural networks as a model that can have variable timesteps t and fixed layers ℓ, just make sure you understand that this is not always the case. Our formalism, especially for weights, will slightly differ.

Formalism

So, now, let’s walk through the formal mathematical notation involved in RNNs.

If an input or output neuron has a value at timestep t, we denote the vector as:

For the hidden neurons it’s a bit different; since we can have multiple hidden layers, we denote the hidden state vector at timestep t and hidden layer ℓ as:

The input is obviously some preset values that we know. The outputs and hidden states are not; they are calculated.

Let’s start with hidden states. First, we’ll revisit the most complex recurrent net we came across earlier — the many to many architecture:

Many to many, non-synchronized.

This RNN has: sequential input, sequential output, multiple timesteps, and multiple hidden layers. The formula we derive for this RNN should generalize for all others.

First, let’s list out the possible functional dependencies for a given hidden state, based on the arrows and flow of information in the diagram:

An input

Hidden state at the previous timestep, same layer

Hidden state at the current timestep, previous layer

A hidden state can have two functional dependencies at max. Just by looking at the diagram, the only impossible combination is to be dependent on both the input and a hidden state at the current timestep but previous layer. This is because the only hidden states that are dependent on input exist in the first hidden layer, where no such previous layer exists.

If this is all difficult to follow, make sure once again to look at and trace back the arrows in the RNN that illustrate flow of information throughout the network.

Because of the impossible combination, we define two separate equations: an equation for the hidden state at hidden layer 1, and for layers after 1.

This probably looks a bit confusing; let me break it down for you. The function ƒw computes the numeric hidden state vector for timestep t and layer ℓ; it contains the “activation function” you’re used to hearing about with ANNs. W are the weights of the recurrent net, and thus ƒ is conditioned on W. We haven’t exactly defined ƒ just yet, but what’s important to note is the two parameters it takes. Once you do, this notation simply states what we have stated before in plain English:

Where ℓ = 1, the hidden state at time t and layer ℓ is a function of the hidden state vector at time t-1 and layer ℓ as well as the input vector at time t. Where ℓ > 1, this hidden state is a function of the hidden state vector at time t-1 and layer ℓ as well as the hidden state vector at time t, layer ℓ-1.

You might notice that we have a couple issues:

When t = 1 — that is, when each neuron is at the initial timestep — then no previous timestep exists. However, we still attempt to pass h_0 as a parameter to ƒw .

— that is, when each neuron is at the initial timestep — then no previous timestep exists. However, we still attempt to pass as a parameter to . If no input exists at time t — thus, x_t does not exist — then we still attempt to pass x_t as a parameter.

Our respective solutions follow:

Define h_0 for any layer as 0

for any layer as 0 Consider x_t where no input exists at timestep t as 0

If these are 0, then the invalid functional dependency stops existing, and our formal notation still holds up.

We actually have five different types of weight matrices:

Pro-tip: The indices for each weight matrix tell you what they are used for in the recurrent net. W_xh maps an input vector x to a hidden state vector h. W_hht maps a hidden state vector h to another hidden state vector h along the time axis, ie. from h_t-1 to h_t. On the other hand, W_hhd maps a hidden state vector h to another hidden state vector h along the depth axis, ie. from h^(ℓ-1)_t to h^ℓ_t. W_hy maps a hidden state vector h to an output vector y.

Like with ANNs, we also learn and add a constant bias vector, denoted b_h, that can vertically shift what we pass to the activation function. We can also shift our outputs with b_y. More about bias units here.

For both b_h and W_hht/W_hhd, we actually have multiple weight matrices depending on the value of ℓ, as indicated by the superscript. This is because each hidden layer can have a different set of weights (the network would be extremely uninteresting if this wasn’t the case), including the bias vector. However, inside a single hidden layer, all timesteps share the same weight matrix. This is important because the number of timesteps is a variable; we may train on sequences with up to 20 values, but in practice output sequences with up to 30 values — 10 extra timesteps. If each timestep had an independent weight to learn, those last 10 timesteps wouldn’t have anything to use. Since this would also mean that the number of parameters in the neural network would grow linearly relative to the input, we would have way too many parameters very potentially causing overfitting.

W_hy is just one matrix because only the final layer gives rise to the outputs denoted y. At the final hidden layer ℓ, we could suggest that W_hhd will not exist because W_hy will be in its place.

Now we’ll define the function ƒw:

The function is very similar to the ANN hidden function you’ve seen before; it applies the correct weights to the corresponding parameters, adds the bias, and passes this weighted sum through an activation or “squashing” function to introduce non-linearities. The key difference, though, is that this is not a weighted sum but rather a weighted sum vector; any W ⋅ h, along with the bias, will have the dimensions of a vector. The tanh function will thus simply output a vector where each value is the tanh of what it was in the inputted vector (sort of like an element-wise tanh). Remember, this contrasts ANNs because RNNs operate over vectors versus scalars.

If you’ve followed our blog so far, you most likely know about two activation functions: sigmoid and ReLU. tanh is another such function. We mostly use the tanh function with RNNs. This is, I think, mostly because of their role in LSTMs (a variant of RNNs that are used more than RNNs — more on that later), the fact that they produce gradients with a greater range, and that their second derivative don’t die off as quickly.

Similar to sigmoid, the tanh function has two horizontal asymptotes and a smooth S-shape. The main difference is that the tanh function asymptotes at y = -1 instead of y = 0, intercepting the y-axis at y = 0 instead of y = 0.5. Thus, the tanh function has a greater range than the sigmoid.

If interested, the tanh equation follows (though I won’t walk you through it):

The final equation is mapping a hidden state to an output.

This is one such possible equation. Depending on the context, we might also remove the bias vector, apply a non-linearity like sigmoid/softmax (for example if the output needs to be a probability distribution), etc.

And that’s how we express recurrent nets, mathematically!

Quick note: Notation may and will differ between various lectures, research paper, articles, etc. For example — some research papers may start indexing at 0 instead of 1. More drastically, most RNN notation is much more general than mine to promote simplicity, ie. doesn’t cover edge cases like I did or obfuscates certain indices like ℓ with hidden to hidden weight matrices. So, just keep note that specifics don’t always transfer over and avoid being confused by this. The reason I was meticulous about notation in this article is that I wanted to ensure you understood exactly how RNNs work, fueled by my frustration with the very same problem ~1.5 years ago.

An example? Okay!

Let’s take a look at a quick example of an RNN in action. I’m going to adapt a super dumbed down one from Andrej Karpathy’s Stanford CS231n RNN lecture, where a one to many “character level language model” single layer recurrent neural network needs to output “hello”. We’ll kick it of by giving the RNN the letter “h” , such that it needs to complete the word by outputting the other four letters.

Sidenote: this model nicknamed “char-rnn” — remember it for later, where we get to code our own!

The neural network has the vocabulary: h, e, l , o. That is, it only knows these four characters; exactly enough to produce the word “hello”. We will input the first character, “h”, and from there expect the output at the following timesteps to be: “e”, “l”, “l”, and “o” respectively, to form:

hello

We can represent input and output via one hot encoding, where each character is a vector with a 1 at the corresponding character position and otherwise all 0s. For example, since our vocabulary is [h, e, l, o], we can represent characters using a vector with four values, where a 1 in the first, second, third, and fourth position would represent “h”, “e”, “l”, and “o” respectively.

This is called “one-hot encoding”, because only one of the values in the vector is equal to 1 and thus on (or “hot”).

This is what we’d expect with a trained RNN:

As you can see, we input the first letter and the word is completed. We don’t know exactly what the hidden states will be — that’s why they’re hidden!

One interesting technique would be to sample the output at each timestep and feed it into the next as input:

When we “sample” from a distribution, we select a random character probabilistically following the distribution. For example, in the diagram above, the character with the highest likeliness is “e” at the first timestep’s output. Let’s say this likeliness is, concretely, 0.9. Now, when we sample into the next timestep’s input, there’s a 90% chance we select “e”; most of the time we will pick the most likely character, but not every time. This adds a level of randomness so you don’t end up in a loop where you keep sampling the same letter or sequence of letters over and over again.

As mentioned earlier, this is used pretty heavily with LCRNs. It’s even more effective than only relying on the memory of the RNN to output the correct letter at the future timesteps. In a sense, this makes the recurrent net many to many. (Though, not really, because we still only have one preset input.)

However, to be clear, this does not mean that the RNN can only rely on these sampled inputs. For example, at timestep 3 the input is “l” and the expected output is also “l”. However, at timestep 4, the input is again “l” but the output is now “o”, to complete the word. Memory is still needed to make a distinction like this.

In numerical form, it would look something like this:

Of course, we won’t get a one-hot vector output during prediction mode; rather, we will get a probability distribution over each letter (so we’d apply softmax to the output), and will sample from this distribution to get a single character output.

Each hidden state would contain a similar sort of vector, though not necessarily something we could interpret like we can for the output.

The RNN is saying: given “h”, “e” is most likely to be the next character. Given “he”, “l” is the next likely character. With “hel”, “l” should be next, and with “hell”, the final character should be “o”.

But, if the neural network wasn’t trained on the word “hello”, and thus didn’t have optimal weights (ie. just randomly initialized weights), then we’d have garble like “hleol” coming out.

One more important thing to note: start and end tokens. They signify when input begins and when output ends. For example, when the final character is outputted (“o”), we can sample this back as input and expect that the “<END>” token (however we choose to represent it — could also use a period) will be outputted at the next timestep; this is the RNN telling us that it has completed the word and its processing as a whole. The use case isn’t as obvious in this fabricated example, because we know when “hello” has been completed, but consider a real-life scenario where we don’t: image captioning. In image captioning, the caption could be 1, 2, 3, or n words long, given a reasonable upper limit of n. The end token tells us when the caption has been completed, so we can halt the RNN and complete the prediction loop (which would keep going forever if we were using while or stop after the upper limit/max possible preset constant value of n is reached).

Start tokens are more used for generating content from complete scratch. For example, imagine an RNN read and learned from a bunch of Shakespeare. (This is an actual funny application of character level language models that Karpathy implemented, and we’ll see it in action on a later section.) Now, based on what the RNN learned, we want it to create a brand new Shakespearean sonnet! Feeding in a “<START>” token enables it to kick this process off and begin writing without us giving the network some arbitrary pre-determined initial word or character.

I’ve also noticed that another potential use case of start tokens is when we have some other sort of initial input, like CNN produced image data with image captioning, that doesn’t “fit” what we’ll normally use for input at timesteps after t=1 (the word outputted at the previous timestep via sampling). As a result, we feed this data directly to the first hidden state and set the input as “<START>” instead.

Now, just to be clear, the RNN doesn’t magically output these end tokens and recognize the start tokens. We have to add them, along with start tokens, to the training data and vocabulary such that they can be outputted by the recurrent net during prediction time.

This is how we can get RNNs to “write”! More on some examples of text RNNs have actually generated, Shakespeare most certainly included, in a later section.

Training (or, why vanilla RNNs suck.)

For a recurrent net to be useful, it needs to learn proper weights via training. That’s no surprise.

Recall this snippet from earlier:

But, if the neural network wasn’t trained on the word “hello”, and thus didn’t have optimal weights (ie. just randomly initialized weights), then we’d have garble like “hleol” coming out.

This is, of course, because we initialize the W weights randomly at first, so random stuff will come out.

But, through multiple iterations of training with a first-order optimization algorithm like gradient descent, we perturb the weights such that the probability of each correct character being outputted at their respective timestep increases. The actual output would be “hello” in one-hot encoding form, and we’d compute the discrepancy between this output and what the recurrent net predicts (we’d get the error at each timestep and then add this up) as the total error to then calculate the gradient/update value.

So, each output contributes to the error somehow. If the error is an addition of the outputs, then, if we had something like Y outputs, we’d need to backpropagate them individually and add these up. This is because derivatives are distributed evenly when we’re differentiating a sum:

For any arbitrary weight W.

But, you should know that, with artificial neural networks, calculating these gradients isn’t that easy. We have so many weights contributing to the output, and thus need to figure out exactly how much these weights contribute, and by how much we modify them to decrease overall error. To do this, we use the backpropagation algorithm; this algorithm propagates the error between the predicted output of a recurrent net and the actual output in the dataset all the way back to the beginning of the network. Using the chain rule from differential calculus, backprop helps us calculate the gradients of the output error w.r.t. each individual weight (sort of like the error of each individual weight).

Once we have those gradients, we have to use an optimization algorithm to calculate the update values and make the updates. We can use the vanilla gradient descent algorithm to do this, but there are many other possible, better variants as well; learn about them by reading this article, if you want. (I think we’re long overdue for our own mega-post on optimization!)

Backpropagation with RNNs is called “Backpropagation Through Time” (short for BPTT), since it operates on sequences in time. But don’t be fooled — there’s not much difference between normal backprop and BPTT; when it comes down to it, BPTT is just backprop, but on RNNs! Remember that when you “unroll” an RNN, it essentially becomes a feedforward network; not an ANN, but a feedforward network in the sense that we can visualize where all the information is flowing and observe the activations at each neuron and timestep, all the way from the input to the final output. Like ANNs, RNNs have functional dependencies that link the entire network together; it’s just that RNNs operate over vectors instead (yay for matrix calculus?) and extend in depth as well as time. There’s more work to do to compute the gradients, but it’s no surprise that backprop works pretty much the same way for recurrent nets that it would for normal ones. Because of this, I’m not going to walk through all the math and show the derivatives etc. Read our backprop mega-post for all that jazz.

One thing to note is that, since we have multiple timesteps in our RNN, each timestep in a single layer will want to change the weight in a different way and have different gradients. However, remember that each hidden layer uses only one weight matrix because the number of timesteps is a variable. Thus, we just average or sum the weight updates between these timesteps and apply this as an update to the W_hh for that entire layer. Also, a general practice is to train on shorter sequences first and then gradually increase sequence size as we train on more and more data.

Now, if you haven’t already, make sure to read this article that I wrote on vanishing and exploding gradients before proceeding:

You may be thinking: how does this issue apply to RNNs? Well, RNNs are very deep models; on top of often having multiple hidden layers, each hidden layer in practice can have hundreds of timesteps. That’s like an ANN with hundreds of entire hidden layers! That’s deep. (Well, it’s more long because we’re dealing with the time axis here, but you know what I mean.) tanh derivatives are very similar to sigmoid derivatives in range, so the problem of vanishing gradients is thus even more drastic with RNNs than with ANNs, and training them becomes almost impossible.

Imagine trying to propagate the error to the 1st timestep in an RNN with k timesteps. The derivative would look something like this:

With a tanh activation function, that’s freaking crazy. Then, for getting the derivative of the error with respect to a weight matrix W_hh, we’d add — or, as mentioned before, we could average as well — each of these hidden state error gradients, then multiplied by the derivative of the hidden state with respect to the weight, such that we can backprop from the error to the weight:

Assuming our sequence is of length k.

So we’d be effectively adding together a bunch of terms that have vanished — the exception being very late gradients with a small number of terms — and so dJ/dWhh would only capture gradient signals from the last few timesteps. (Or, for exploding gradients, it would become infinity).

But, you might be asking, instead of tanh — which is bounded between -1 and 1, and has a similar problem to sigmoid where the peak of the derivative is smaller than 1 — why don’t we just use ReLUs? Don’t ReLUs, or perhaps leaky ReLUs, solve the vanishing gradient problem?

Well, not entirely; it’s not enough to solve the problem. With RNNs, the problem really lies in the architecture. Even though we could use ReLU to ensure many of the values in the gradient computation are not between -1, 0, and 1 such that they vanish — or vice-versa, explode — we do still indeed have a lot of other variables other than the activation function derivative in the gradient computation such as the weights; you can revisit the mega-post on backprop we wrote to confirm this. Since weights are also normally randomly initialized in the range -1 to 1, and RNNs are like super deep ANNs, these weights keep multiplying on top of each other and potentially cause the gradients to vanish.

This is more my suspicion though — I’m yet to confirm this is the case by testing. I was curious so I asked this exact question on Quora:

From this, something interesting I learned is that: since ReLUs are unbounded (it’s not restricted to be between -1 and 1 or 0 and 1) unlike sigmoid/tanh, and RNNs are very deep, the activations, especially later ones, can become too big. This is because hidden states have a multiplicative relationship; one hidden state is a multiple of the previous ones, where that multiple specifically is a weight. If we use ReLU, then the hidden state isn’t limited by any range, and we could have a bunch of numbers bigger than 1 multiplying by each other.

It ends up being sort of like the exploding gradient problem, but with the values inside the neurons, not gradients. This is also what then causes the gradients to explode: large activations → large gradients → large change in weights → even bigger activations, because updating the weights in the wrong direction ever so slightly can cause the entire network to explode. This makes learning unstable:

This means that the computation within the RNN can potentially blow up to infinity without sensible weights. This makes learning VERY unstable because a slight shift in the weights in the wrong direction during backprop can blow up the activations during the forward pass. So that’s why you see most people using sigmoid/tanh units, despite the vanishing gradient descent problem.

Also well said:

With RNN’s, the problem is that you are repeatedly applying your RNN to itself, which tends to [mostly] cause exponential blowup or [rarely, but sometimes] shrinkage.

Other issues with ReLU functions are discussed in the article I wrote, and they similarly apply to RNNs. Generally speaking, though, they just don’t work that well, especially compared to other options we have. Making RNNs perform well with ReLUs is actually a pretty hot topic of research right now, but until someone figures out something genius, RNNs are a lost cause.

And that’s why vanilla RNNs suck. Seriously. In practice, nobody uses them. Even if you didn’t fully grasp this section on how the vanishing and exploding gradient/activation problem is applicable to them, it doesn’t matter anyways. Because, everything you’ve read up to this point so far… throw it all away. Forget about it.

Just kidding. Don’t do that.

Fixing the problem with LSTMs (Part I)

You shouldn’t do that because RNNs actually aren’t a lost cause. They’re far from it. We just need to make a few… modifications.

Enter the LSTM.

Makes sense, no?

How about this?

OK. Clearly something’s not registering here. But that’s fine; LSTM diagrams are frikin’ difficult for beginners to grasp. I too remember when I first searched up “LSTM” on Google to encounter something similar to the works of art above. I reacted like this:

MRW first Google Image-ing LSTMs.

In this section, I’m going to embark on a mission to design the first simple, comprehensible, and beautiful LSTM diagram. Wish me luck, because I’ll probably fail.

With that being said, let’s dive into Long Short-Term Memory networks. (Yes, that’s what LSTM stands for.)