I present a new implementation of the stream cipher Trivium designed for cryptanalysts, in particular those interested in applying the “cube attack” to Trivium. It generates 128 simultaneous output streams using SSE2 intrinsics, and achieves under 1 cycle/byte, over four times faster than standard implementations. The entire program is in Python; SSE2 machine instructions are generated and called using the tool CorePy, an approach I am happy to recommend to others with similar needs. The code is under the MIT licence and may be found in this Mercurial repository.

Dinur and Shamir’s “cube attack” caused something of a media storm when it was first presented at a rump session at Crypto 2008. The attack may be widely applicable: it takes an extremely general view of the cipher under attack, and can be applied even where the cipher is treated as a “black box” into which key and IV bits are put in and output bits result, with no consideration of its internal structure. Early reports that this attack would break all of our most trusted primitives are false, however – the attack depends on the degree of the polynomial representing the working of the cipher, and for a cipher like AES the degree is far too high for the attack to work.

The one real cipher against which the attack has been applied is Trivium, a personal favourite of mine with real beauty in the design; astonishingly simple, neatly implementable in both hardware and software, and carefully thought out resistance to cryptanalysis. Trivium has 1152 initialization rounds; the cube attack breaks a weakened variant with 735 initialization rounds in less than 230 operations, and the authors say “We believe that by using more advanced techniques, we can break much stronger Trivium variants, and maybe even the original cipher”.

As referenced above, applying the cube attack to a cipher is to a certain extent an empirical business; one searches for a set of “maxterms” that break the cipher, and this involves running the cipher many, many times. It occurred to me that for this use case, we could improve on the standard implementation of the cipher by going back to the roots of bitslicing.

The block cipher DES is designed entirely around efficient hardware implementation, and is notoriously difficult to implement in software. In particular, several parts of the cipher require complex permutations in bits – in hardware, this is just some wires crossing, but in software a fiddly sequence of shifts, rotates and masks. In 1997, Eli Biham presented A Fast New DES Implementation in Software. This implementation achieved its speed – about five times faster than the fastest previously known – by doing 64 simultaneous block encryptions in parallel, treating a normal 64-bit Alpha computer as a SIMD computer, and using each bit position of a word to describe part of the state of one of the 64 simultaneous encryptions. Starting from a description of a logic gate circuit to implement DES, this implementation would use standard logical operations to mirror the work of the circuit. No shifts or rotates are needed – each column of bits is an independent part of the work, and each logical operation carries 64 times its weight in useful work done. This technique was dubbed “bitslicing” by Matthew Kwan, and the name stuck.

Not long after this, the block cipher Serpent was published by Anderson, Biham, and Knudsen. Serpent takes advantage of bitslicing, not for multiple parallel instances, but to make a single instance of the cipher. The state of the cipher is four 32-bit words, and the nonlinearity in the cipher is a four-bit S-box on bit columsn of these words implemented using bitslicing. These alternate with a linear mixing stage based on shifts, rotates and XORs. This proved a popular technique, and Trivium itself is designed to be implemented using bitslicing in a similar way, advancing the state by up to 64 bits at a time using logical operations on words.

However, this implementation is bitsliced in the old fashioned way – it is 128 independent instances of the Trivium cipher, represented as vectors of 128 bits. This has two advantages over a standard Trivium implementation: twice as many bits can be operated on in a single instruction using SSE2 intrinsics, and no shifts or masks are needed. As a result, it operates at over four times the speed, achieving under 1 cycle/byte – 2.4 GBytes/sec – on my 2.4 GHz Intel Core 2 Duo PC.

One unusual feature of this implementation is worth mentioning – it’s in Python. Rather than writing the code in assember in the traditional way, I took advantage of CorePy to program directly against the x86 ISA in Python. In CorePy, machine instructions, registers and suchlike are first-class objects which can be composed to create executable code. The result is something like writing code to generate assembler input as text, but far more satisfying to work with, resulting in cleaner abstractions and code. We do not shell out to an assembler, but directly call the CorePy code object we generate. I probably would not have written this at all if I had not been inspired to try CorePy when our colleage Majek drew attention to it, and if I had I doubt it would have been finished within a few hours of starting as it was.

The codebase includes our bitslice implementation and a very simple pure-Python Trivium implementation for testing, as well as code to test one against the other, to benchmark the fast implementation, and to use our fast implementation to test the “cube attack”. Please do let me know if you’re playing with this or putting it to use, I look forward to hearing from you.

Download most recent revision