The Problem

As noted in the section on constant-time crypto, integer multiplication opcodes in CPU may or may not execute in constant time; when they do not, implementations that use such operations may exhibit execution time variations that depend on the involved data, thereby potentially leaking secret information.

In current BearSSL (version 0.3, 2017-01-29), potentially affected implementations are:

GHASH (for GCM): ctmul, ctmul32, ctmul64

Poly1305: ctmul, ctmul32

RSA: i15, i31, i32

Elliptic curves: prime_i15, prime_i31, p256_m15, p256_m31, c25519_i15, c25519_i31, c25519_m15, c25519_m31

This page aims at aggregating information on which CPU provide constant-time multiplications, and what can be done when working on CPU that do not.

Generic Workarounds

When a CPU has non-constant-time multiplication opcodes, the execution time depends on the size (as integers) of one or both of the operands. For instance, on the ARM Cortex-M3, the umull opcode takes 3 to 5 cycles to complete, the “short” counts (3 or 4) being taken only if both operands are numerically less than 65536 (so says the manual, but see below). Thus, such a multiplication should be made constant-time by forcing the high bits to 1. In BearSSL, the “i31” big-integer code (used for RSA and elliptic curves) internally uses a macro called MUL31() , that takes as inputs two 31-bit words (represented in uint32_t values) and outputs the 62-bit result (in an uint64_t ). There are two versions for this macro; the normal one is a simple multiplication, while the alternate tries to be constant-time:

#define MUL31(x, y) ((uint64_t)((x) | (uint32_t)0x80000000) \ * (uint64_t)((y) | (uint32_t)0x80000000) \ - ((uint64_t)(x) << 31) - ((uint64_t)(y) << 31) \ - ((uint64_t)1 << 62))

The alternate macro is used if the BR_CT_MUL31 compile-time option is set (either with a compiler invocation switch, or by setting the option in src/config.h ). The resulting code will be both slightly bigger, and noticeably slower, but at least it should make RSA and elliptic curves constant-time on a Cortex-M3.

Or not. Tanja Lange pointed me to this study by Wouter de Groot, who performed actual measures on some Cortex-M3 devices, from which it appeared that a short cycle count could occur not only in the documented case (operands fit in 16 bits) but also when one or both of the operands is zero or a power of 2. In the case of MUL31() , this means that the underlying umull will complete in 5 cycles, except if one or both of the two operands are zero, in which case the umull will complete in 4 cycles. Therefore, on the tested devices (and, presumably, on all devices that use a Cortex-M3 core), MUL31() does not guarantee a fully constant-time implementation. On “normal” data the short cycle count will be quite rare (about one in a billion invocations for full-width operands) but it may happen much more often on the top bits (e.g. when using 1024-bit integers with 31-bit limbs, the top limb contains only 1 bit of data, since 31×33 = 1023). How to leverage it in an actual attack is still an open question at this point; normal caution dictates that the M3 should be avoided for now.

Even within documented behaviour (ignoring cases like the Cortex-M3), the MUL31() is not sufficient for all architectures. On systems where the multiplication opcode yields only the low 32 bits (as is the case for ARM platforms in Thumb mode, that do not support Thumb-2), obtaining a 64-bit result necessarily involves a software subroutine which will invoke the multiplication opcode several times, with operands truncated in some way. The truncation may yield smaller values, hence a short cycle count in some cases. The subroutine may also have conditional branches. Finally, that 32-bit-only multiplication opcode may perform early exits when the top bits are all equal to one (this is the case of the ARM9T cores in Thumb mode). A similar situation arises on some architectures such as PowerPC, where the low and high words are obtained with two separate opcodes, and the one that yields the low word may again use signed early exit.

A more general strategy is to implement big integers with 15-bit words, not 31-bit; thus, a multiplication of two 15-bit words can be implemented as:

#define MUL15(x, y) ((uint32_t)((x) | (uint32_t)0x80000000) \ * (uint32_t)((y) | (uint32_t)0x80000000) \ & (uint32_t)0x3FFFFFFF)

BearSSL now implements that macro, both in the “slow” version shown above, and the “fast” version which is a simple multiplication. As in the case of MUL31 , the slow version must be activated with a specific compile-time macro called BR_CT_MUL15 .

Since the top two bits of each operand would be forced to 10 , the multiplication should be constant-time on most if not all existing CPU (at least all CPU with registers of 32 bits or more). Moreover, using only 32-bit results is beneficial to platforms that do not have an opcode that yields the top 32 bits of a multiplication (in particular the Cortex-M0 and M0+, and the older ARM in Thumb mode, before Thumb-2 was introduced).

It shall be noted that, at least theoretically, a given CPU could implement a non constant-time multiplication even in the case of the slow version of MUL15() , by returning early when one of the operands is a power of two (that is, in the context of MUL15() , when one of the macro parameters is zero). I am not aware of any CPU that currently implements such a shortcut for a 32→32 multiplication, but the Cortex M3 does it for the 32→64 multiplication, so such a possibility cannot be absolutely excluded.

Compilers can be an additional source of trouble. When the operands of a multiplication are larger than what the underlying platform offers, a software routine is necessarily involved, either inlined or as an external function call. This is the case when, for instance, multiplying two 64-bit operands ( int64_t or uint64_t ) on a 32-bit architecture. A typical 32-bit system will have a 32→64 opcode (or possibly a pair of opcodes, one returning the low word, another returning the high word), but for 64-bit operands, a software subroutine may be used. Depending on the compiler, that routine may or may not be constant-time.

In general, compilers will use the direct 32→64 opcode when they can “see” that the two operands are unsigned 32-bit integers (i.e. they are both uint32_t values, freshly cast to uint64_t for the purpose of multiplication). In any case, if you ensure that your operands always fit on the low 32 bits, then any software shortcut when the high words are zero will always be taken, thereby avoiding any leak.

Assembly gives more control on what exactly is executed by the CPU, and may open extra options not available in portable C, e.g. vector instructions. When a vector instruction performs two or more multiplications in parallel, it can be assumed that they are constant-time. Assembly also gives control on operand order, which may matter for some architectures.

The generic drawback of assembly is its inherent lack of portability, and increased need for maintenance. And while using assembly avoids issues with C compilers, it does not solve variance within the CPU itself.

Yet another workaround is acceptance. It has been noted by Käsper and Schwabe that in GCM, the GHASH part is computed over either encrypted or non-confidential data, so it cannot leak information on secret data; moreover, its secret key is derived from the encryption key through what amounts to a one-way function, so even if that key was unveiled, encryption would still be safe. In the ChaCha20+Poly1305 cipher suite, the MAC key is even more transient: a new MAC key is pseudorandomly produced for each record. Timing attacks being statistical in nature, requiring accumulation of measures to yield substantial amounts of data, it is plausible that a non-constant-time Poly1305 implementation would not seriously endanger the integrity of records.

Therefore, a non-constant-time Poly1305, because of non-constant-time multiplication, should not be much of a worry. A non-constant-time GHASH is a bit more problematic, in the context of long-lived sessions with many records being processed with the same GHASH key; but this is still less serious than a non-constant-time AES or RSA implementation. It is still better, if possible at a low cost, to go full constant-time.

Per CPU Information

This section is supposed to be enriched over time with information on whether multiplications on some specific CPU are constant-time or not. All the data below has been gleaned from various sources, mostly technical manuals. Since instruction timings can vary at the whim of the CPU vendor, possibly between any two sub-revisions, and are not necessarily well documented, what I wrote below is incomplete and there probably are a few mistakes.

Contributions are most welcome! Please send up any information you may have on the timing behaviour of multiply instructions.

In the tables below, cores are grouped by architecture, and the timing behaviour of multiplications is provided in the following cases:

32→32: multiplication of two 32-bit unsigned words, only the low 32 bits of the result are kept. If the CPU is constant-time for this kind of multiplication, then br_ghash_ctmul32() , br_poly1305_ctmul32() , and all the “i15” and “m15” code (for RSA and EC) are constant-time.

32→64: multiplication of two 32-bit unsigned words, result in a 64-bit unsigned integer. If the CPU is constant-time for this kind of multiplication, then br_ghash_ctmul() , br_poly1305_ctmul() , RSA “i31” and “i32” implementations, and EC “i31” implementation, are constant-time.

MUL31: multiplication of two 31-bit unsigned words, using the alternate MUL31() macro described above. If the CPU is constant-time for this multiplication, then RSA “i31” and EC “i31” implementations are constant-time, provided that the BR_CT_MUL31 option is set.

64→64: multiplication of two 64-bit unsigned words, keeping the low 64 bits of the result. If the CPU is constant-time for this multiplication, then br_ghash_ctmul64() is constant-time.

For each multiplication category, the status will be one of the following:

Y : constant-time.

N : not constant-time.

S+ : operation not directly provided by the hardware; a software implementation is included by the compiler, and it is probably constant-time.

S : operation not directly provided by the hardware; a software implementation is included by the compiler, and it is possibly constant-time (depending on compiler + library).

S- : operation not directly provided by the hardware; a software implementation is included by the compiler, and it is probably not constant-time.

?: status unknown (more information needed!).

A common assumption is that if the platform offers a 32→64 opcode for unsigned multiplications, then the C compiler will be smart enough to notice that 32-bit unsigned integers, freshly cast to uint64_t , are still really 32-bit values, and thus their product may use the platform opcode directly instead of using a generic 64→64 routine. In the case of GCC, this assumption holds, unless compilation is done without any optimisations at all ( -O0 ).

x86

On the x86 line of CPU, in 32-bit mode, the 32→32 multiplication is typically done with an imul opcode (nominally signed, but this does not matter in that case), while the 32→64 multiplication is done with mul , that writes the unsigned result in the edx (high word) and eax registers.

For the 64→64 multiplication, GCC inlines the relevant routine, and it combines one mul (for the two low words), two imul , and two additions; that routine is constant-time. However, in the same situation, Microsoft Visual C relies on a library routine which is not constant-time, in that it returns early if the top words are all-zeros or all-ones. This has been exploited to extract the private key from a server running Curve25519. Thus, on x86 in 32-bit mode, 64→64 multiplications are deemed “S”, not “S+”: they will be constant-time, or not, depending on the used compiler (a not too comfortable situation).

In 64-bit mode, the imul opcode is always used, operands being extended with zeros if necessary. The mul opcode will be used only when trying to obtain a 128-bit output (with a non-standard type like __m128 ); on recent cores (Intel Haswell or later), a variant called mulx allows choosing the two target registers for the 128-bit output, which reduces contention and allows for better pipelining.

In general, Intel x86 CPU all provide constant-time multiplications since the days of the first Pentium. The same does not necessarily hold for other vendors, in particular the early VIA Nano.

CPU type 32→32 32→64 MUL31 64→64 x86 (32-bit) Intel 80386 N N Y S- Intel 80486 N N Y S- Intel P5 (Pentium, Pentium MMX) Y Y Y S Intel P6 / Pentium M (Pentium Pro to Core) Y Y Y S Intel NetBurst (Pentium IV) Y Y Y S Intel Core 2 Y Y Y S Intel Nehalem (early Core i3/i5/i7) Y Y Y S Intel Sandy Bridge / Ivy Bridge Y Y Y S Intel Haswell Y Y Y S Intel Broadwell Y Y Y S Intel Skylake Y Y Y S Intel Bonnell (Atom) Y Y Y S Intel Silvermont (Atom) Y Y Y S Intel Goldmont (Atom) Y Y Y S AMD K5 ? ? ? ? AMD K6 ? ? ? ? AMD K7 (Athlon) Y Y Y S AMD K8 (Hammer) Y Y Y S AMD K10 Y Y Y S AMD Bobcat Y Y Y S AMD Bulldozer Y Y Y S AMD Jaguar Y Y Y S AMD Zen Y Y Y S VIA C3 ? ? ? ? VIA C7 ? ? ? ? VIA Nano 2000 Series N N Y S- VIA Nano 3000 Series Y Y Y S x86 (64-bit) Intel Core 2 Y Y Y Y Intel Nehalem (early Core i3/i5/i7) Y Y Y Y Intel Sandy Bridge / Ivy Bridge Y Y Y Y Intel Haswell Y Y Y Y Intel Broadwell Y Y Y Y Intel Skylake Y Y Y Y Intel Bonnell (Atom) Y Y Y Y Intel Silvermont (Atom) Y Y Y Y Intel Goldmont (Atom) Y Y Y Y AMD K8 (Hammer) Y Y Y Y AMD K10 Y Y Y Y AMD Bobcat Y Y Y Y AMD Bulldozer Y Y Y Y AMD Jaguar Y Y Y Y AMD Zen Y Y Y Y VIA Nano 2000 Series N N Y N VIA Nano 3000 Series Y Y Y Y

ARM

The ARM line of CPU has four instruction sets:

The original set, called “A32”, with 32-bit opcodes.

The “Thumb” set, with 16-bit opcodes.

The “Thumb-2” set, that extends Thumb set with some 32-bit opcodes.

The “A64” set, with 32-bit opcodes and 64-bit registers.

In recent ARM terminology, the Thumb and Thumb-2 sets are called “T32”.

The A32 and Thumb-2 sets offer a native unsigned 32→64 opcode, called umull . However, the Thumb set has only a 32→32 opcode (called mul ). ARM cores before the ARM11 know only Thumb, not Thumb-2; however, they also support the original A32 set, and switching from one set to the other is efficient (this can be done as part of a function call, so that A32 and Thumb sets may be used together), so software on these older cores may conceivably use an A32 routine to handle a 32→64 or 64→64 multiplication.

In A32 and Thumb-2 modes, GCC uses mul for 32→32, umull for 32→64, and an inline subroutine with one umull and two mul for 64→64. In Thumb mode, GCC uses an external library function called __aeabi_lmul() for both 32→64 and 64→64; the implementation for that function is not constant-time, even in the 32→64 case (it uses a conditional branch for carry propagation, for what corresponds to what umull would have provided).

The low power Cortex M0, M0+ and M1, know Thumb only — however, these offer a constant-time 32→32 opcode. The Cortex M3 has Thumb-2, so it has both the 32→32 opcode (which is constant-time) and the 32→64 opcode (which is not constant-time).

As was described in a previous section, the actual M3 device uses some undocumented “early exit” strategy in some cases, that may prevent the MUL31() macro from guaranteeing constant-time processing in some cases, which may or may not matter in practice. Of course, any other processor with variable-time multiplication opcode may exhibit a similar behaviour (that’s the problem with undocumented behaviour).

Important caveat: in the table below, it is assumed that the code has been compiled to benefit from the abilities of the CPU; e.g. that code compiled for an ARM9E core will use the whole Thumb-2 instruction set. Code running on an ARM9E but compiled for the ARM9T will use a software subroutine for a 32→64 multiplication, since the ARM9T does not support the relevant Thumb-2 opcode ( umull ).

Other vendors have licensed the ARM architectures, and may have used constant-time or non-constant-time implementations.

CPU type 32→32 32→64 MUL31 64→64 ARM7T (A32) N N Y S- ARM7T (Thumb) N S- S- S- ARM9T (A32) N N Y S- ARM9T (Thumb) N S- S- S- ARM9E Y Y Y S+ ARM10E Y Y Y S+ ARM11 Y Y Y S+ Cortex-A (A32) Y Y Y S+ Cortex-A (A64) ? ? ? ? Cortex-R (A32) Y Y Y S+ Cortex-M0/M0+/M1 Y S- S- S- Cortex-M3 Y N N S- Cortex-M4 Y Y Y S+

PowerPC

Processors following the PowerPC architecture have been produced by many vendors, with varying behaviours with regards to multiplications.

In 32-bit mode, a 64-bit result is obtained by using two opcodes, mulhwu to get the high word, mullw for the low word. Both may return “early”; in particular, mullw may treat its operands as signed and may return early if the top bits are all-zeros or all-ones. In particular, this prevents MUL31() from guaranteeing constant-time multiplications.

Recent, “large” CPU offer a 64-bit mode; however, even on systems where the 64-bit mode is supported, many applications are compiled in 32-bit mode, since that reduces memory usage (contrary to the x86 situation, 32-bit PowerPC code has access to as many registers as 64-bit PowerPC, making the use of 64-bit mode comparatively less attractive).

Late incarnations of the POWER architecture (POWER8, POWER9), which can be thought of as PowerPC’s big brother, meant for powerfull desktop systems and servers, have constant-time multiplications.

CPU type 32→32 32→64 MUL31 64→64 PowerPC 603/603e (RAD6000) N N N S- PowerPC 7xx (G3, RAD750) N N N S- PowerPC 74xx (G4) N N N S- PowerPC 970 (G5) (32-bit) ? ? ? ? PowerPC 970 (G5) (64-bit) ? ? ? ? PPE (Cell, Xenon) (32-bit) Y Y Y S+ PPE (Cell, Xenon) (64-bit) Y Y Y Y POWER8 (32-bit) Y Y Y S+ POWER8 (64-bit) Y Y Y Y POWER9 (32-bit) Y Y Y S+ POWER9 (64-bit) Y Y Y Y

Other Architectures

Please contribute any information on CPU types and architectures not listed above!