How to Visualize Your Recurrent Neural Network with Attention in Keras

A technical discussion and tutorial

Neural networks are taking over every part of our lives. In particular — thanks to deep learning — Siri can fetch you a taxi using your voice; and Google can enhance and organize your photos automagically. Here at Datalogue, we use deep learning to structurally and semantically understand data, allowing us to prepare it for use automatically.

Neural networks are massively successful in the domain of computer vision. Specifically, convolutional neural networks(CNNs) take images and extract relevant features from them by using small windows that travel over the image. This understanding can be leveraged to identify objects from your camera (Google Lens) and, in the future, even drive your car (NVIDIA).

The analogous neural network for text data is the recurrent neural network (RNN). This kind of network is designed for sequential data and applies the same function to the words or characters of the text. These models are successful in translation (Google Translate), speech recognition (Cortana) and language generation.

Dr. Yoshua Bengio from Université de Montréal believes that language is going to be the next challenge for neural networks. At Datalogue, we deal with tons of text data, and we are interested in helping the community solve this challenge.

In this tutorial, we will write an RNN in Keras that can translate human dates (“November 5, 2016”, “5th November 2016”) into a standard format (“2016–11–05”). In particular, we want to gain some intuition into how the neural network did this. We will leverage the concept of attention to generate a map (like that shown in Figure 1) that shows which input characters were important in predicting the output characters.

Figure 1: Attention map for the freeform date “5 Jan 2016”. We can see that the neural network used “16” to decide that the year was 2016, “Ja” to decide that the month was 01 and the first bit of the date to decide the day of the month.

What is in this Tutorial

We start off with some technical background material to get everyone on the same page and then move on to programming this thing! Throughout the tutorial, I will provide links to more advanced content.

If you want to directly jump to the code:

What you need to know

To be able to jump into the coding part of this tutorial, it would be best if you have some familiarity with Python and Keras. You should have some familiarity of linear algebra, in particular that neural networks are just a few weight matrices with some non-linearities applied to it.

The intuition of RNNs and seq2seq models will be explained below.

Recurrent Neural Networks (RNN)

An RNN is a function that applies the same transformation (known as the RNN cell or step) to every element of a sequence. The output of an RNN layer is the output of the RNN cell applied on each element of the sequence. In the case of text, these are usually successive words or characters. In addition to this, RNN cells hold an internal memory that summarize the history of the sequence it has seen so far.

Figure 2: An example of an RNN Layer. The RNN Cell is applied to each element xi in the original sequence to get the corresponding element hi in the output sequence. The RNN cell uses a previous hidden state (propagated between the RNN cells) and the sequence element xi to do its calculations.

The output of the RNN layer is an encoded sequence, h , that can be manipulated and passed into another network. RNNs are incredibly flexible in their inputs and outputs:

Many-to-One: use the complete input sequence to make a single prediction h . One-to-Many: transforming a single input to generate a sequence h . Many-to-Many: transforming the entire input sequence into another sequence.

In theory, the sequences in our training data need not be of the same length. In practice, we pad or truncate them to be of the same length to take advantage of the static computational graph in Tensorflow.

We will focus on #3 “Many-to-Many” also known as sequence-to-sequence (seq2seq).

Long sequences can be difficult to learn from in RNNs due to instabilities in gradient calculations during training. To solve this, the RNN cell is replaced by a gated cell like the gated recurrent unit (GRU) or the long-short term memory cell (LSTM). To learn more about LSTMs and gated units, I highly recommend reading Christopher Olah’s blog (it’s where I first started understanding the RNN cell). From now on, whenever we talk about RNNs, we are talking about a gated cell. Since Olah’s blog gives an intuitive introduction to LSTMs, we will use that.

A General Framework for seq2seq: The Encoder-Decoder Setup

Almost all neural network approaches to solving the seq2seq problem involve:

Encoding the input sentences into some abstract representation. Manipulating this encoding. Decoding it to our target sequence.

Our encoders and decoders can be any kind and combination of neural networks. In practice, most people use RNNs for both the encoders and decoders.

Figure 3: Set up of the encoder-decoder architecture. The encoder network processes the input sequence into an encoded sequence which is subsequently used by the decoder network to produce the output.

Figure 3 shows a simple encoder-decoder setup. The encoding step usually produces a sequence of vectors, h , corresponding to the sequence of characters in the input date, x . In an RNN encoder, each vector is generated by integrating information from the past sequence of vectors.