Now I am not the first person writing the blog on LSTM. It is a well known technique widely used to model temporal data. And yet, many of us struggle to get the intuition. That is why I am writing . . . to simplify the mathematical intuition as much as possible. Now 21 years ago, Sepp Hochreiter, in Munich, would not have imagined, how the ideas can resurrect. A 21 years old concept becoming relevant again with advances in Deep Learning post 2009.

Let me tell you my rendezvous with LSTMs. Back in 2015, I used to work in ad targeting. This means, showing up the ads based on your preferences. We learn these preferences based on data stored in your browser cookies by the clients(eg. online shopping website). This can be the items in your amazon cart(just an example). Without going into the details, the model which we used for consumer behavior was not temporal in nature. But there is an obvious temporal locality in our browsing patterns. For example, it is likely that the items you browse in amazon or flipkart are in a sequence. They can be a reflection of your brain’s internal sequence of preference. I found it pretty intuitive to use LSTMs, given the nature of problem. And this my friend, led to 7% increase in Click through rate (CTR) for campaigns. At that moment, I felt so proud to see that finally, my habit of binge reading AI blogs and research papers paid off.

Now lets take a deep breath to roll our eyeballs to the next section, which talks about the internals of LSTM.

Simple Recurrent Neural Networks were out there, much before LSTMs (1986). The idea was to introduce feedback structure for context modelling. They used fixed weight kinda feedback structures. If you look at it from deep learning perspective, you are just creating connections between the current feature map(intermediate outputs in CNN) and its past self. It is like viewing your old photographs and learning from your own younger version.

But the problems like exploding and vanishing gradients, made it a living nightmare to train such systems for complex time series problems. The other problem was that the recurrent structures were either exploring short term temporal structures or the long term structures. But not both simultaneously.

Top 3 Most Popular Ai Articles:

In 1997, Sepp’s work addressed these problems of Simple Recurrent Networks, by introducing gating and memory cells. An interesting analogy can be drawn between memory cells in LSTMs and memory elements in your calculators, storing the numbers you ask it to.

Now, what’s a gate? Let us say, my hidden unit, and memory cell is a 3x1 vector. Then a gate is a bunch of 3 sigmoids. They are like those custom officers on airports, controlling entry and exit of goods aka information. Now by introduction of cells in recurrent structures, we are having dedicated elements in neural network to remember things, with better control on what to remember(forget gate), how to remember(input gate) and what to give out (output gate). Earlier it was just a bunch of weights/ weight matrices between hidden features and its own past.

The brains behind the gate

Each of this sigmoid activation is just like puppets controlled by master manipulator/puppeteer, which is input and previous hidden units. All the three gates are a weighted combo of these puppeteers.

The cell state is updated by forgetting things from the past which are not relevant based on forget gate. We then use tanh activation over the weighted combo of the same puppeteers to get the prospective cell’s additions/subtractions.

If you observe the equation, we are using Tanh activation for obtaining updates for cell. This is ensures that the updates can be both positive and negative and voila, there we have our updates. On contrary, sigmoid’s range is between 0 and 1.

We then screen these updates based on the input gate, (i.e by multiplying the sigmoid to the updates) and add it to the main memory cell.

Now everything that happens inside the house, should not go public right? That is why we have the output gate, which tells, what part of tanh(c_t) should go out (to h_t).

Gated Recurrent Unit (GRU)

Now, GRU tries to undo, what LSTM did, without loosing the benefits. It gets rid of seperate memory cell, and keeps the gating mechanism to update the recurrent component. The forget and input gates are combined in a complimentary fashion to reduce the unnecessary complexity in model(reducing recurrent parameters by 33%). Lets say GRU has 10 hidden units. If you choose to keep 2 of them long term (by forgetting rest 8), then you are implicitly choosing to update the rest of the units with new information(addition/subtraction) coming from h tilde. This coupled input and forget gate is renamed as update gate.

So thats it. I hope that you understand the intuition behind LSTM and GRU.

I highly recommend you to read blog by Chris Olah on LSTMs, if you already haven’t. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

I have also published a video on youtube, explaining different variants of Recurrent Structures, so watch it out :)

https://youtu.be/yM2wqxhOb74?t=219