Transformer models are an increasingly popular neural network architecture in the natural language processing (NLP) research field, where large transformers can achieve the state-of-the-art performance on many tasks. The tradeoff is transformers’ excessive compute consumption and cost, especially for training models on long sequences.

A recent paper published by Google and UC Berkeley researchers and accepted by the prestigious International Conference on Learning Representations (ICLR 2020) proposes a new transformer model called “Reformer” which achieves impressive performance even when running on only a single GPU.

To improve transformer efficiency, researchers replaced dot-product attention with locality-sensitive hashing (LSH) to change the complexity from O (L2) to O (L log L), where L refers to the length of the sequence. LSH is an algorithmic technique used for nearest neighbor search when mining similar items from massive data.

Researchers also used reversible residual layers instead of standard residuals, which enabled storing activations only once during the training process instead of N times (where N represents the number of layers). The final Reformer model performed similarly compared to the Transformer model, but showed higher storage efficiency and faster speed on long sequences.

Researchers conducted experiments on the image generation task imagenet64 with sequences of length 12K and a text task enwik8 with sequences of length 64K, to compare the conventional Transformer with the proposed reversible Transformer. Both Transformers had the same number of parameters and the learning curves were almost the same. The experiment results showed that the reversible Transformer saves memory without sacrificing accuracy.

Effect of shared query-key space (left) and reversibility (right) on performance on enwik8 and imagenet64 training. The curves show bits per dim on held-out data.

LSH attention is an approximation of full attention, and its accuracy improves as the hash value increases. When the hash value is 8, LSH attention is almost equivalent to full attention. Generally speaking, the computational cost of the model increases with the increase of the hash value. This allows researchers to adjust the hash value according to their own calculation budget.

LSH attention performance as a function of hashing rounds on imagenet64.

Researchers tested the LSH attention performance on enwik8, which also showed the relationship between the speed and sequence length of different attention types while the total number of tokens remained unchanged. The results show that conventional attention slows down as the sequence length increases, while LSH attention speed remains steady.

(left) LSH attention performance as a function of number of layers on enwik8, and (right) attention evaluation speed as a function of input length for full- and LSH- attention.

The paper has been selected by ICLR 2020, where it received a near-perfect score of “8, 8, 6”. The study has garnered critical acclaim in the research community and is expected to have a significant impact on the field.

The paper Reformer: The Efficient Transformer is on OpenReview.