We now describe two kinds of experiments based (i) on synthetic and (ii) on real-world tasks. The former test the representational power of RUMs vs. LSTMs/GRUs, and the latter test whether RUMs also perform well for real-world NLP problems.

Also, there are one or two cos θ instances close to 1.0 per distribution, that is, the angle is close to zero and the hidden state is rotated only marginally. The distributions in Figure 5 (c) yield more varied representations.

Figure 5 (b) shows the evolution of the rotational behavior during the 53 time steps for a model that does not learn the task. We can see that cos θ is small and biased towards 0.2. Figure 5 (c) shows the evolution of a model with associative memory (λ = 1) that does learn the task. Note that these distributions have a wider range that is more uniform.

The benefit of the associative memory is apparent from the temperature map in Figure 5 (a), where we can see that the weight kernel for the target memory has a clear diagonal activation. This suggests that the model learns how to rotate the hidden state in the Euclidean space by observing the sequence encoded in the hidden state. Note that none of our baseline models exhibit such a pattern for the weight kernels.

Table 1: Associative recall results. T is the input length. Note that line 8 still learns the task completely for T = 50, but it needs more than 100k training steps. Moreover, varying the activations or removing the update gate does not change the result in the last line.

Results. Table 1 shows the results. We can see that LSTM and GRU are unable to recall the digit correctly. Even GORU, which learns the copying task, fails to solve the problem. FW-LN, WeiNet, and RUM can learn the task for T = 30. For RUM, it is necessary that λ = 1, as for λ = 0 its performance is similar to that of LSTM and GORU. WeiNet and RUM are the only known models that can learn the task for the challenging 50 input characters. Note that RUM yields 100% accuracy with 40% fewer parameters.

Models. We test LSTM, GRU, GORU, FW-LN (Ba et al., 2016a ), WeiNet (Zhang and Zhou, 2017 ), and RUM (λ = 1, η = 0). All the models have the same hidden state N h = 50 for different lengths T . We train for 100k epochs with a batch size of 128, RMSProp as an optimizer, and a learning rate of 0.001 (selected using hyper-parameter search).

Data. The sequences for training are random, and consist of pairs of letters and digits. We set the query key to always be a letter. We fix the size of the letter set to half the length of the sequence, the digits are from 0 to 9. No letter is repeated. In particular, the RNN is fed a sequence of letter–digit pairs followed by the separation indicator ‘‘??’’ and a query letter (key), e.g., ‘‘a1s2d3f4g5??d’’. The RNN is supposed to output the digit that follows the query key (‘‘d’’ in this example): It needs to find the query key and then to output the digit that follows (‘‘3’’ in this example). The train/dev/test split is 100k/10k/20k examples.

Finally, we would like to note that removing the update gate or using tanh and softsign activations do not hurt performance.

We also observe the benefits of tuning the associative rotational memory. Indeed, a RUM with λ = 1 has a smaller hidden size, N h = 100, but it learns much faster than a RUM with λ = 0. It is possible that the accumulation of phase via λ = 1 enables faster really long-term memory.

Controlling the norm of the hidden state is also important. The activations of LSTM and GRU are sigmoid and tanh, respectively, and both are bounded. RUM uses ReLU , which allows larger hidden states (nevertheless, note that RUM with the bounded tanh also yields 100% accuracy). We observe that, when we remove the normalization, RUM converges faster compared with using η = 1.0. Having no time normalization means larger spikes in the cross-entropy and increased risk of exploding loss. EURNN and uRNN are exposed to this, while RUM uses a tunable reduction of the risk through time normalization.

Next, we study why RUM units can solve the task, whereas LSTM/GRU units cannot . In Figure 4 , we also test a RUM model (called RUM ′ ) without a flexible target memory and embedded input, that is, the weight kernels that produce τ t and ε ~ t are kept constant. We observe that the model does not learn well (converges extremely slowly). This means that learning to rotate the hidden state by having control over the angles used for rotations is indeed needed.

Results. In Figure 4 , we show the cross-entropy loss for delay time T = 500. Note that LSTM and GRU hit a predictable baseline of memoryless strategy, equivalent to random guessing. 4 We can see that RUM improves over the baseline and converges to 100% accuracy. For the explicit unitary models, EURNN and uRNN also solve the problem in just a few steps, and GORU converges slightly faster than RUM.

Models. We test RNNs built using various types of units: LSTM (Hochreiter and Schmidhuber, 1997 ), GRU (Cho et al., 2014 ), uRNN (Wisdom et al., 2016 ), EURNN (Jing et al., 2017b ), GORU (Jing et al., 2017a ), and RUM (ours) with λ ∈{0,1} and η ∈{1.0,N/A}. We train with a batch size of 128 and an RMSProp with a 0.9 decay rate, and we try learning rates from {0.01, 0.001, 0.0001}. We found that LSTM and GRU fail for all learning rates, EURNN is unstable for large learning rates, and RUM is stable for all learning rates. Thus, we use 0.001 for all units except for EURNN, for which we use 0.0001.

Data. The alphabet of the input consists of symbols { a i }, i ∈{0, 1, ⋯ , n − 1, n , n + 1}, the first n of which represent data for copying, and the remaining two forming ‘‘blank’’ and ‘‘marker’’ symbols, respectively. In our experiments, we set n = 8 and the data for copying is the first 10 symbols of the input. The RNN model is expected to output ‘‘blank’’ during T = 500 delay steps and, after the ‘‘marker’’ appears in the input, to output (copy) sequentially the first 10 input symbols. The train/test split is 50,000/500 examples.

Copying memory task (A) is a standard testbed for the RNN’s capability for long-term memory (Hochreiter and Schmidhuber, 1997 ; Arjovsky et al., 2016 ; Henaff et al., 2016 ). Here, we follow the experimental set-up in Jing et al. ( 2017b ).

4.2 Real-world NLP Tasks

Question answering (C) is typically done using neural networks with external memory, but here we use a vanilla RNN with and without attention.

Data. We use the bAbI Question Answering data set (Weston et al., 2016), which consists of 20 subtasks, with 9k/1k/1k examples for train/ dev/test per subtask. We train a separate model for each subtask. We tokenize the text (at the word and at the sentence level), and then we concatenate the story and the question.

For the word level, we embed the words into dense vectors, and we feed them into the RNN. Hence, the input sequence can be labeled as { x 1 ( s ) , … , x n ( s ) , x 1 ( q ) , … , x m ( q ) } , where the story has n words and the question has m words.

For the sentence level, we generate sentence embeddings by averaging word embeddings. Thus, the input sequence for a story with n sentences is { x 1 ( s ) , … , x n ( s ) , x ( q ) } .

Attention mechanism for sentence level. We use simple dot-product attention (Luong et al., 2015): { p t } 0 ≤ t ≤ n := softmax ( { h ( q ) ⋅ h t ( s ) } 0 ≤ t ≤ n ) . The context vector c := ∑ t = 0 n p t h t ( s ) is then passed, together with the query vector, to a dense layer.

Models. We compare uRNN, EURNN, LSTM, GRU, GORU, and RUM (with η = N/A in all experiments). The RNN model outputs the prediction at the end of the question through a softmax layer. We use a batch size of 32 for all 20 subtasks. We train the model using Adam optimizer with a learning rate of 0.001 (Kingma and Ba, 2015). All embeddings (word and sentence) are 64-dimensional. For each subset, we train until convergence on the dev set, without other regularization. For testing, we report the average accuracy over the 20 subtasks.

Results.Table 2 shows the average accuracy on the 20 bAbI tasks. Without attention, RUM outperforms LSTM/GRU and all unitary baseline models by a sizable margin both at the word and at the sentence level. Moreover, RUM without attention (line 14) outperforms all models except for attnLSTM. Furthermore, LSTM and GRU benefit the most from adding attention (lines 10–11), while the phase-coded models (lines 9, 12–15) obtain only a small boost in performance or even a decrease (e.g., in line 13). Although RUM (line 14) shares the best accuracy with LSTM (line 10), we hypothesize that a ‘‘phase-inspired’’ attention might further boost RUM’s performance.5

Table 2: Question answering results. Accuracy averaged over the 20 bAbI tasks. Using tanh is worse than ReLU (line 13 vs. 15). RUM 150 λ = 0 without an update gate drops by 1.7% compared with line 13.

Language modeling [character-level] (D) is an important testbed for RNNs (Graves, 2013).

Data. The Penn Treebank (PTB) corpus is a collection of articles from The Wall Street Journal (Marcus et al., 1993), with a vocabulary of 10k words (using 50 different characters). We use a train/dev/test split of 5.1M/400k/450k tokens, and we replace rare words with <unk>. We feed 150 tokens at a time, and we use a batch size of 128.

Models. We incorporate RUM into a recent high-level model: Fast-Slow RNN (FS-RNN) (Mujika et al., 2017). The FS-RNN-k architecture consists of two hierarchical layers: one of them is a ‘‘fast’’ layer that connects k RNN cells F 1 , …, F k in series; the other is a ‘‘slow’’ layer that consists of a single RNN cell S. The organization is roughly as follows: F 1 receives the input from the mini-batch and feeds its state into S, S feeds its state into F 2 , and so on; finally, the output of F k is a probability distribution over characters. FS-RUM-2 uses fast cells (all LSTM) with hidden size of 700 and a slow cell (RUM) with a hidden state of size 1000, time normalization η = 1.0, and λ = 0. We also tried to use associative memory λ = 1 or to avoid time normalization, but we encountered exploding loss at early training stages. We optimized all hyper-parameters on the dev set.

Additionally, we tested FS-EURNN-2, i.e., the slow cell is EURNN with a hidden size of 2000, and FS-GORU-2 with a slow cell GORU with a hidden size of 800 (everything else remains as for FS-RUM-2). As the learning phases are periodic, there is no easy regularization for FS-EURNN-2 or FS-GORU-2.

For FS-RNN, we use the hyper-parameter values suggested in Mujika et al. (2017). We further use layer normalization (Ba et al., 2016b) on all states, on the LSTM gates, on the RUM update gate, and on the target memory. We also apply zoneout (Krueger et al., 2017) to the recurrent connections, as well as dropout (Srivastava et al., 2014). We embed each character into a 128-dimensional space (without pre-training).

For training, we use the Adam optimizer with a learning rate of 0.002, we decay the learning rate for the last few training epochs, and we apply gradient clipping with a maximal norm of the gradients equal to 1.0. Finally, we pass the output through a softmax layer.

For testing, we report bits-per-character (BPC) loss on the test dataset, which is the cross-entropy loss but with a binary logarithm.

Our best FS-RUM-2 uses decaying learning rate: 180 epochs with a learning rate of 0.002, then 60 epochs with 0.0001, and finally 120 epochs with 0.00001.

We also test a RUM with η = 1.0, and a two-layer RUM with η = 0.3. The cell zoneout/hidden zoneout/dropout probability is 0.5/0.9/0.35 for FS-RUM-2, and 0.5/0.1/0.65 for the vanilla versions. We train for 100 epochs with a 0.002 learning rate. These values were suggested by Mujika et al. (2017), who used LSTM cells.

Results. In Table 3, we report the BPC loss for character-level language modeling on PTB. For the test split, FS-RUM-2 reduces the BPC for Fast-Slow models by 0.001 points absolute. Moreover, we achieved a decrease of 0.002 BPC points for the validation split using an FS-RUM-2 model with a hidden size of 800 for the slow cell (RUM) and a hidden size of 1100 for the fast cells (LSTM). Our results support a conjecture from the conclusions of Mujika et al. (2017), which states that models with long-term memory, when used as the slow cell, may enhance performance.

Table 3: Character-level language modeling results. BPC score on the PTB test split. Using tanh is slightly better than ReLU (lines 2–3). Removing the update gate in line 1 is worse than line 2. Phase-inspired regularization may improve lines 1–3, 6–8, 9–10, and 16.

Text summarization (E) is the task of reducing long pieces of text to short summaries without losing much information. It is one of the most challenging tasks in NLP (Nenkova and McKeown, 2011), with a number of applications ranging from question answering to journalism (Tatalović, 2018). Text summarization can be abstractive (Nallapati et al., 2016), extractive (Nallapati et al., 2017), or hybrid (See et al., 2017). Advances in encoder-decoder/seq2seq models (Cho et al., 2014; Sutskever et al., 2014) established models based on RNNs as powerful tools for text summarization. Having accumulated knowledge from the ablation and the preparation tasks, we test RUM on this hard real-world NLP task.

Data. We follow the set-up in See et al. (2017) and we use the CNN/ Daily Mail corpus (Hermann et al., 2015; Nallapati et al., 2016), which consist of news stories with reference summaries. On average, there are 781 tokens per story and 56 tokens per summary. The train/dev/test datasets contain 287,226/13,368/11,490 text–summary pairs.

We further experimented with a new data set, which we crawled from the Science Daily Web site, iterating certain patterns of date/time. We successfully extracted 60,900 Web pages, each containing a public story about a recent scientific paper. We extracted the main content, a short summary, and a title from the HTML page using Beautiful Soup. The input story lengthis 488.42 ± 219.47, the target summary length is 45.21 ± 18.60, and the title length is 9.35 ± 2.84. In our experiments, we set the vocabulary size to 50k.

We defined four tasks on this data: (i) s2s, story to summary, (ii) sh2s, shuffled story to summary (we put the paragraphs in the story in a random order); (iii) s2t, story to title; and (iv) oods2s, out-of-domain testing for s2s (i.e., training on CNN / Daily Mail and testing on Science Daily).

Models. We use a pointer-generator network (See et al., 2017), which is a combination of a seq2seq model (Nallapati et al., 2016) with attention (Bahdanau et al., 2015) and a pointer network (Vinyals et al., 2015). We believe that the pointer-generator network architecture to be a good testbed for experiments with a new RNN unit because it enables both abstractive and extractive summarization.

We adopt the model from See et al. (2017) as our LEAD baseline. This model uses a bi-directional LSTM encoder (400 steps) with attention distribution and an LSTM decoder (100 steps for training and 120 steps for testing), with all hidden states being 256-dimensional, and 128-dimensional word embeddings trained from scratch during training. For training, we use the cross-entropy loss for the seq2seq model. For evaluation, we use ROUGE (Lin and Hovy, 2003). We also allow the coverage mechanism proposed in the original paper, which penalizes repetitions and improves the quality of the summaries (marked as ‘‘cov.’’ in Table 4). Following the original paper, we train LEAD for 270k iterations and we turn on the coverage for about 3k iterations at the end to get LEAD cov. We use an Adagrad optimizer with a learning rate of 0.15, an accumulator value of 0.1, and a batch size of 16. For decoding, we use a beam of size 4.

Table 4 Text summarization results. Shown are ROUGE F-{1,2,L} scores on the test split for the CNN / Daily Mail and the Science Daily datasets. Some settings are different from ours: lines 8–9 show results when training and testing on an anonymized data set, and lines 12–14 use reinforcement learning. The ROUGE scores have a 95% confidence interval ranging within ±0.25 points absolute. For lines 2 and 7, the maximum decoder steps during testing is 100. In lines 15–18, L/dR stands for LEAD/decRUM. Replacing ReLU with tanh or removing the update gate in decRUM line 17 yields a drop in ROUGE of 0.01/0.09/0.25 and 0.36/0.39/0.42 points absolute, respectively.

The only component in LEAD that our proposed models change is the type of the RNN unit for the encoder/decoder. Namely, encRUM is a LEAD with a bidirectional RUM as an encoder (but with an LSTM decoder), decRUM is LEAD with a RUM as a decoder (but with a bi-LSTM encoder), and allRUM is LEAD with all LSTM units replaced by RUM ones. We train these models as LEAD, by minimizing the validation cross-entropy. We found that encRUM and allRUM take about 100k training steps to converge, while decRUM takes about 270k steps. Then, we turn on coverage training, as advised by See et al. (2017), and we train for a few thousand steps {2k,3k,4k,5k,8k}. The best ROUGE on dev was achieved for 2k steps, and this is what we used ultimately. We did not use time normalization as training was stable without it. We used the same hidden sizes for the LSTM, the RUM, and the mixed models. For the size of the hidden units, we tried {256, 360, 400, 512} on the dev set, and we found that 256 worked best overall.

Results.Table 4 shows ROUGE scores for the CNN / Daily Mail and the Science Daily test splits. We can see that RUM can easily replace LSTM in the pointer-generator network. We found that the best place to use RUM is in the decoder of the seq2seq model, since decRUM is better than encRUM and allRUM. Overall, we obtained the best results with decRUM 256 (lines 2 and 7), and we observed slight improvements for some ROUGE variants over previous work (i.e., with respect to lines 10–11).

We further trained decRUM with coverage for about 2,000 additional steps, which yielded 0.01 points of increase for ROUGE 1 (but with reduced ROUGE 2/L). We can conclude that here, as in the language modeling study (D), a combination of LSTM and RUM is better than using LSTM-only or RUM-only seq2seq models.

We conjecture that using RUM in the decoder is better because the encoder already has an attention mechanism and thus does not need much long-term memory, and would better focus on a more local context (as in LSTM). However, long-term memory is crucial for the decoder as it has to generate fluent output, and the attention mechanism cannot help it (i.e., better to use RUM). This is in line with our attention experiments on question answering. In future work, we plan to investigate combinations of LSTM and RUM units in more detail to identify optimal phase-coded attention.

Incorporating RUM into the seq2seq model yields larger gradients, compatible with stable training. Figure 6(a)shows the global norm of the gradients for our baseline models. Because of the tanh activation, LSTM’s gradients hit the 1.0 baseline even though gradient clipping is 2.0. All RUM-based models have larger global norm. decRUM 360 sustains a slightly higher norm than LEAD, which might be beneficial. Panel 6(b), a consequence of 6(a), demonstrates that the RUM decoder sustains hidden states of higher norm throughout training. Panel 6(c) shows the contribution of the output at each encoder step to the gradient updates of the model. We observe that an LSTM encoder (in LEAD and decRUM) yields slightly higher gradient updates to the model, which is in line with our conjecture that it is better to use an LSTM encoder. Finally, panel 6(d) shows the gradient updates at each decoder step. Although the overall performance of LEAD and decRUM is similar, we note that the last few gradient updates from a RUM decoder are zero, while they are slightly above zero for LSTM. This happens because the target summaries for a minibatch are actually shorter than 100 tokens. Here, RUM exhibits an interesting property: It identifies that the target summary has ended, and thus for the subsequent extra steps, our model stops the gradients from updating the weights. An LSTM decoder keeps updating during the extra steps, which might indicate that it does not identify the end of the target summary properly.

We also compare our best decRUM 256 model to LEAD on the Science Daily data (lines 15–18). In Table 4, lines 15–17, we retrain the models from scratch. We can see that LEAD has clear advantage on the easiest task (line 15), which generally requires copying the first few sentences of the Science Daily article.

In line 16, this advantage decreases, as shuffling the paragraphs makes the task harder. We further observe that our RUM-based model demonstrates better performance on ROUGE F-2/L in line 17, where the task is highly abstractive.

Out-of-domain performance. In line 18, decRUM 256 and LEAD are pretrained on CNN / Daily Mail (models from lines 1–2), and our RUM-based model shows clear advantage on all ROUGE metrics. We also observe examples that are better than the ones coming from LEAD (see for example the story6 in Figure 1). We hypothesize that RUM is better on out-of-domain data due to its associative nature, as can be seen in Equation (2): At inference, the weight matrix updates for the hidden state depend explicitly on the current input.