Hello!

While working on some more bombastic demos, we decided to do a straight-up comparison between LSTM/GRU based recurrent neural networks and our fast online learning library, EOgmaNeo.

We will compare RNNs and EOgmaNeo on a simple sequence task: Sequence copying.

I know, I know, lame! But, (perhaps selfishly) I think the results are interesting regardless.

The sequence copying problem consists of a sequence of symbols which must be recalled exactly after the sequence ends. This means it requires the network to have memory, as it must remember all the symbols in the string in order to be able to recall them. The task gets harder as the sequence length grows and as the number of symbols grows.

The reason for using this task is that it is simple to implement and compare with, while testing memory capabilities. While it is not indicative of real-world tasks, it provides at least some insight into the advantages of the algorithms.

A little information on the algorithms:

LSTM – The LSTM network used in these comparisons will be implemented in Chainer. It will consist of one layer of LSTM cells followed by a softmax layer.

GRU – The GRU-based network used in these comparisons will also be implemented in Chainer, and will also consist of one layer of cells (GRU cells).

EOgmaNeo – This is the library implementing the latest version of our sparse predictive hierarchy technology. Unlike the previous techniques, it is not based on backpropagation, but rather on local expectation-maximization algorithms and sparse distributed representations (SDRs/sparse codes).

Unlike LSTM and GRU, EOgmaNeo handles recurrence very differently: It uses a concept we call “Exponential Memory”, which is a very particular kind of clockwork RNN that uses a bidirectional hierarchy to map between different timescales. It gets its name from how it has a fixed memory horizon, but one that grows exponentially with respect to the number of layers in the hierarchy.

Finally, it is a fully online learning algorithm, meaning it does not require stochastic sampling to decorrelate incoming samples through time. The hierarchy used in this comparison will have a layer count appropriate for the length of the sequences in question, and will have 6.25% sparsity (ratio of active binary units).

For the symbols, we will use digits 0-3 (4 symbols). The reason for this seemingly arbitrary number is that 4 is a square number, and EOgmaNeo has a square number of bits in each input chunk. While it is not necessary to do this (we could use 10 symbols, with 6 unused ones for 4^2), we might as well. Therefore, the variant of the problem using sequences of length 4 and 4 symbols would look like this:

3202 → 3202

1231 → 1231

1122 → 1122

3121 → 3121

…

Each symbol will be encoded as a one-hot vector (all 0’s except for the index of the symbol, which is 1). The network in question must therefore take a symbol as input each timestep, and produce a symbol prediction of the next timestep.

To test generalization, we will keep a small test set of sequences (might be limited due to the problem size) that will not be presented during training. Additionally, for larger sequence lengths, it is not feasible to memorize every combination anyways.

The way the networks receive the data over time is similar, with one key difference: EOgmaNeo sees the dataset as one large string, while LSTM/GRU gets each problem instance on its own. This is because we can help LSTM/GRU a little by resetting memory at the beginning of each problem instance, and use BPTT for only that instance. EOgmaNeo is designed for online learning, and doesn’t have such a mechanism, so it will have to learn to forget previous problem instances on its own.

All algorithms will be timed for their execution speed. For a fair comparison, all these tests were run on a CPU (i7, 3.4Ghz).

We will run each experiment three times to average out at least some of the randomness.

Task 1:

4 Symbols, Sequence Length 4.

LSTM – 1 layer, 256 cells. Adam optimizer.

GRU – 1 layer, 256 cells. Adam optimizer.

EOgmaNeo – 4 layers, 256 units per layer (multiple layers required for memory).

All trained for 1000 problem instances.

Algorithm/Run Execution Time Test Set Accuracy LSTM Run 1 16s 10/10 GRU Run 1 27s 10/10 EOgmaNeo Run 1 3s 10/10 LSTM Run 2 20s 7/10 GRU Run 2 30s 10/10 EOgmaNeo Run 2 3s 10/10 LSTM Run 3 17s 9/10 GRU Run 3 29s 10/10 EOgmaNeo Run 3 3s 10/10

So far so decent. All algorithms did reasonably well. However, you may notice the execution speed of each differs wildly: EOgmaNeo, due to its reliance on online learning, sparse representations, and multi-timescale learning executes far faster than the others. Interestingly, GRU was a fair bit slower than LSTM, which I did not expect, as GRUs are generally viewed as a “simpler” recurrent gating mechanism.

So, let’s take if up a notch.

Task 2:

4 Symbols, Sequence Length 8.

LSTM – 1 layer, 256 cells. Adam optimizer.

GRU – 1 layer, 256 cells. Adam optimizer.

EOgmaNeo – 4 layers, 256 units per layer (multiple layers required for memory).

All trained for 6000 problem instances.

Algorithm/Run Execution Time Test Set Accuracy LSTM Run 1 3m12s 1/10 GRU Run 1 5m21s 9/10 EOgmaNeo Run 1 34s 10/10 LSTM Run 2 3m56s 0/10 GRU Run 2 5m11s 0/10 EOgmaNeo Run 2 36s 10/10 LSTM Run 3 3m32s 0/10 GRU Run 3 4m58s 5/10 EOgmaNeo Run 3 36s 10/10

This time, only EOgmaNeo completed the task to a 10/10 score on all runs. LSTM did not perform well. It only copied one of the test sequences correctly. GRU correctly copied 9/10 sequences on one run though.

Summary

So, it seems there are indeed powerful alternatives to backpropagation. Backpropagation, while a powerful algorithm, is very slow due to the reliance on offline stochastic sampling and dense representations. We alleviate this in EOgmaNeo by taking a more biologically plausible route, and avoiding backpropagation altogether, while embracing non-differentiability and sparsity.

The code used to test the algorithms can be found in our Github repository:

Note: If you would like to try this code you will have to update the sequence length, training length, and switch between LSTM/GRU to reflect the task.

If you are interested. you may review the code for the LSTM/GRU – we want to make sure we are not making any mistakes!

Update: Thanks to a reddit user, we were able to get the LSTM to 95% accuracy! However, it was with PyTorch instead of chainer. It also still ran extremely slowly, but at least the accuracy gap is a lot smaller now!

We will have some more comparisons and other demos soon!

Until next time!