How a Rust upgrade more than tripled the speed of my code

I’d like to share a quick story about the sheer power of LLVM and the benefits of using higher-level languages over assembly.

I work at Parity Technologies, who maintains the Parity Ethereum client. In this client we have a need for performant 256-bit arithmetic, which we have to emulate in software since no modern hardware supports it natively.

For a long time we’ve maintained parallel implementations of arithmetic, one in Rust for stable builds and one in inline assembly (which is automatically used when you compile with the nightly compiler). We do this because we store these 256-bit numbers as arrays of 64-bit numbers and there is no way to multiply two 64-bit numbers to get a more-than-64-bit result in Rust (since Rust’s integer types only go up to u64 ). This is despite the fact that x86_64 (our main target platform) natively supports 128-bit results of calculations with 64-bit numbers. So, we resort to splitting the 64-bit numbers into two 32-bit numbers (because we can multiply two 32-bit numbers to get a 64-bit result).

impl U256 { fn full_mul ( self , other : Self ) -> U512 { let U256 ( ref me ) = self ; let U256 ( ref you ) = other ; let mut ret = [ 0 u64 ; U512_SIZE ]; for i in 0 .. U256_SIZE { let mut carry = 0 u64 ; // `split` splits a 64-bit number into upper and lower halves let ( b_u , b_l ) = split ( you [ i ]); for j in 0 .. U256_SIZE { // This process is so slow that it's faster to check for 0 and skip // it if possible. if me [ j ] != 0 || carry != 0 { let a = split ( me [ j ]); // `mul_u32` multiplies a 64-bit number that's been split into // an `(upper, lower)` pair by a 32-bit number to get a 96-bit // result. Yes, 96-bit (it returns a `(u32, u64)` pair). let ( c_l , overflow_l ) = mul_u32 ( a , b_l , ret [ i + j ]); // Since we have to multiply by a 64-bit number, we have to do // this twice. let ( c_u , overflow_u ) = mul_u32 ( a , b_u , c_l >> 32 ); ret [ i + j ] = ( c_l & 0xffffffff ) + ( c_u << 32 ); // Then we have to do this complex logic to set the result. Gross. let res = ( c_u >> 32 ) + ( overflow_u << 32 ); let ( res , o1 ) = res . overflowing_add ( overflow_l + carry ); let ( res , o2 ) = res . overflowing_add ( ret [ i + j + 1 ]); ret [ i + j + 1 ] = res ; carry = ( o1 | o2 ) as u64 ; } } } U512 ( ret ) } }

You don’t even have to understand all of the code to see how non-optimal this is. Inspecting the output of the compiler shows that the generated assembly is extremely suboptimal. It does much more work than necessary essentially just to work around limitations in the Rust language. So we wrote an inline assembly version. The important thing about using inline assembly here is that x86_64 natively supports multiplying two 64-bit values into a 128-bit result. When Rust does a * b when a and b are both u64 the CPU actually multiplies them to create a 128-bit result and then Rust just throws away the upper 64 bits. We want the upper 64 in this case though, and the only way to access it efficiently is by using inline assembly.

As you can imagine, our assembly implementation was much faster:

name u64.bench ns/iter inline_asm.bench ns/iter diff ns/iter diff % speedup u256_full_mul 243,159 197,396 -45,763 -18.82% x 1.23 u256_mul 268,750 95,843 -172,907 -64.34% x 2.80 u256_mul_small 1,608 789 -819 -50.93% x 2.04

u256_full_mul tests the function above, u256_mul multiplies two 256-bit numbers to get a 256-bit result (in Rust, we just create a 512-bit result and then throw away the top half but in assembly we have a seperate implementation), and u256_mul_small multiplies two small 256-bit numbers. As you can see, the assembly implementation is up to 65% faster. This is way, way better. Unfortunately, it only works on nightly, and even then only on x86_64. The truth is that it was a lot of effort and a number of thrown-away implementations to even get the Rust code to “only” half the speed of the assembly, too. There was simply no good way to give the compiler the information necessary.

All that changed with Rust 1.26. Now we can do a as u128 * b as u128 and the compiler will use x86_64’s native u64-to-u128 multiplication (even though you cast both numbers to u128 it knows that they’re “really” just u64 , you just want a u128 result). That means our code now looks like this:

impl U256 { fn full_mul ( self , other : Self ) -> U512 { let U256 ( ref me ) = self ; let U256 ( ref you ) = other ; let mut ret = [ 0 u64 ; U512_SIZE ]; for i in 0 .. U256_SIZE { let mut carry = 0 u64 ; let b = you [ i ]; for j in 0 .. U256_SIZE { let a = me [ j ]; // This compiles down to just use x86's native 128-bit arithmetic let ( hi , low ) = split_u128 ( a as u128 * b as u128 ); let overflow = { let existing_low = & mut ret [ i + j ]; let ( low , o ) = low . overflowing_add ( * existing_low ); * existing_low = low ; o }; carry = { let existing_hi = & mut ret [ i + j + 1 ]; let hi = hi + overflow as u64 ; let ( hi , o0 ) = hi . overflowing_add ( carry ); let ( hi , o1 ) = hi . overflowing_add ( * existing_hi ); * existing_hi = hi ; ( o0 | o1 ) as u64 } } } U512 ( ret ) } }

Although it’s almost certainly not as fast as using the LLVM-native i256 type, the speed is much, much better. Here it is compared to the original Rust implementation:

name u64.bench ns/iter u128.bench ns/iter diff ns/iter diff % speedup u256_full_mul 243,159 73,416 -169,743 -69.81% x 3.31 u256_mul 268,750 85,797 -182,953 -68.08% x 3.13 u256_mul_small 1,608 558 -1,050 -65.30% x 2.88

Which is great, we now get a speed boost on stable. Since we only compile the binaries for the Parity client on stable the only people who could use the assembly before were those who compiled from source, so this is an improvement for a lot of users. But wait, there’s more! The new compiled code actually manages to beat the assembly implementation by a significant margin, even beating the assembly on the benchmark that multiplies two 256-bit numbers to get a 256-bit result. This is despite the fact that the Rust code still produces a 512-bit result first and then discards the upper half, where the assembly implementation does not:

name inline_asm.bench ns/iter u128.bench ns/iter diff ns/iter diff % speedup u256_full_mul 197,396 73,416 -123,980 -62.81% x 2.69 u256_mul 95,843 85,797 -10,046 -10.48% x 1.12 u256_mul_small 789 558 -231 -29.28% x 1.41

For the full multiplication that’s an absolutely massive improvement, especially since the original code used highly-optimised assembly incantations from our resident cycle wizard. Here’s where the faint of heart might want to step out for a moment, because I’m about to dive into the generated assembly.

Here’s the hand-written assembly. I’ve presented it without comment because I want to comment the assembly that is actually emitted by the compiler (since, as you’ll see, the asm! macro hides more than you’d expect):

impl U256 { /// Multiplies two 256-bit integers to produce full 512-bit integer /// No overflow possible pub fn full_mul ( self , other : U256 ) -> U512 { let self_t : & [ u64 ; 4 ] = & self . 0 ; let other_t : & [ u64 ; 4 ] = & other . 0 ; let mut result : [ u64 ; 8 ] = unsafe { :: core :: mem :: uninitialized () }; unsafe { asm ! ( " mov $ 8 , % rax mulq $ 12 mov % rax , $ 0 mov % rdx , $ 1 mov $ 8 , % rax mulq $ 13 add % rax , $ 1 adc $$ 0 , % rdx mov % rdx , $ 2 mov $ 8 , % rax mulq $ 14 add % rax , $ 2 adc $$ 0 , % rdx mov % rdx , $ 3 mov $ 8 , % rax mulq $ 15 add % rax , $ 3 adc $$ 0 , % rdx mov % rdx , $ 4 mov $ 9 , % rax mulq $ 12 add % rax , $ 1 adc % rdx , $ 2 adc $$ 0 , $ 3 adc $$ 0 , $ 4 xor $ 5 , $ 5 adc $$ 0 , $ 5 xor $ 6 , $ 6 adc $$ 0 , $ 6 xor $ 7 , $ 7 adc $$ 0 , $ 7 mov $ 9 , % rax mulq $ 13 add % rax , $ 2 adc % rdx , $ 3 adc $$ 0 , $ 4 adc $$ 0 , $ 5 adc $$ 0 , $ 6 adc $$ 0 , $ 7 mov $ 9 , % rax mulq $ 14 add % rax , $ 3 adc % rdx , $ 4 adc $$ 0 , $ 5 adc $$ 0 , $ 6 adc $$ 0 , $ 7 mov $ 9 , % rax mulq $ 15 add % rax , $ 4 adc % rdx , $ 5 adc $$ 0 , $ 6 adc $$ 0 , $ 7 mov $ 10 , % rax mulq $ 12 add % rax , $ 2 adc % rdx , $ 3 adc $$ 0 , $ 4 adc $$ 0 , $ 5 adc $$ 0 , $ 6 adc $$ 0 , $ 7 mov $ 10 , % rax mulq $ 13 add % rax , $ 3 adc % rdx , $ 4 adc $$ 0 , $ 5 adc $$ 0 , $ 6 adc $$ 0 , $ 7 mov $ 10 , % rax mulq $ 14 add % rax , $ 4 adc % rdx , $ 5 adc $$ 0 , $ 6 adc $$ 0 , $ 7 mov $ 10 , % rax mulq $ 15 add % rax , $ 5 adc % rdx , $ 6 adc $$ 0 , $ 7 mov $ 11 , % rax mulq $ 12 add % rax , $ 3 adc % rdx , $ 4 adc $$ 0 , $ 5 adc $$ 0 , $ 6 adc $$ 0 , $ 7 mov $ 11 , % rax mulq $ 13 add % rax , $ 4 adc % rdx , $ 5 adc $$ 0 , $ 6 adc $$ 0 , $ 7 mov $ 11 , % rax mulq $ 14 add % rax , $ 5 adc % rdx , $ 6 adc $$ 0 , $ 7 mov $ 11 , % rax mulq $ 15 add % rax , $ 6 adc % rdx , $ 7 " : /* $0 */ "={r8}" ( result [ 0 ]), /* $1 */ "={r9}" ( result [ 1 ]), /* $2 */ "={r10}" ( result [ 2 ]), /* $3 */ "={r11}" ( result [ 3 ]), /* $4 */ "={r12}" ( result [ 4 ]), /* $5 */ "={r13}" ( result [ 5 ]), /* $6 */ "={r14}" ( result [ 6 ]), /* $7 */ "={r15}" ( result [ 7 ]) : /* $8 */ "m" ( self_t [ 0 ]), /* $9 */ "m" ( self_t [ 1 ]), /* $10 */ "m" ( self_t [ 2 ]), /* $11 */ "m" ( self_t [ 3 ]), /* $12 */ "m" ( other_t [ 0 ]), /* $13 */ "m" ( other_t [ 1 ]), /* $14 */ "m" ( other_t [ 2 ]), /* $15 */ "m" ( other_t [ 3 ]) : "rax" , "rdx" : ); } U512 ( result ) } }

And here’s what that generates. I’ve heavily commented it so you can understand what’s going on even if you’ve never touched assembly in your life, but you will need to know basic low-level details like the difference between memory and registers. If you want to get a primer on the structure of a CPU, the Wikipedia article on structure and implementation of CPUs is a good place to start:

bigint: : U256: : full_mul: ;; Function prelude - this is generated by Rust pushq %r15 pushq %r14 pushq %r13 pushq %r12 subq $0x40 , %rsp ;; Load the input arrays into registers... movq 0x68 ( %rsp ), %rax movq 0x70 ( %rsp ), %rcx movq 0x78 ( %rsp ), %rdx movq 0x80 ( %rsp ), %rsi movq 0x88 ( %rsp ), %r8 movq 0x90 ( %rsp ), %r9 movq 0x98 ( %rsp ), %r10 movq 0xa0 ( %rsp ), %r11 ;; ...and then immediately back into memory ;; This is done by the Rust compiler. There is a way to avoid ;; this happening but I'll get to that later ;; These four are the first input array movq %rax , 0x38 ( %rsp ) movq %rcx , 0x30 ( %rsp ) movq %rdx , 0x28 ( %rsp ) movq %rsi , 0x20 ( %rsp ) ;; These four are the output array, which is initialised to be ;; the same as the second input array. movq %r8 , 0x18 ( %rsp ) movq %r9 , 0x10 ( %rsp ) movq %r10 , 0x8 ( %rsp ) movq %r11 , ( %rsp ) ;; This is the main loop, you'll see the same code repeated many ;; times since it's been unrolled so I won't go over it every time. ;; This takes the form of a loop that looks like: ;; ;; for i in 0..U256_SIZE { ;; for j in 0..U256_SIZE { ;; /* Loop body */ ;; } ;; } ;; Load the `0`th element of the input array into the "%rax" ;; register so we can operate on it. The first element is actually ;; already in `%rax` at this point but it gets loaded again anyway. ;; This is because the `asm!` macro is hiding a lot of details, which ;; I'll get to later. movq 0x38 ( %rsp ), %rax ;; Multiply it with the `0`th element of the output array This operates ;; on memory rather than a register, and so is significantly slower than ;; if the same operation had been done on a register. Again, I'll get to ;; that soon. mulq 0x18 ( %rsp ) ;; `mulq` multiplies two 64-bit numbers and stores the low and high ;; 64 bits of the result in `%rax` and `%rdx`, respectively. We move ;; the low bits into `%r8` (the lowest 64 bits of the 512-bit result) ;; and the high bits into `%r9` (the second-lowest 64 bits of the ;; result). movq %rax , %r8 movq %rdx , %r9 ;; We do the same for `i = 0, j = 1` movq 0x38 ( %rsp ), %rax mulq 0x10 ( %rsp ) ;; Whereas above we moved the values into the output registers, this time ;; we have to add the results to the output. addq %rax , %r9 ;; Here we add 0 because the CPU will use the "carry bit" (whether or not ;; the previous addition overflowed) as an additional input. This is ;; essentially the same as adding 1 to `rdx` if the previous addition ;; overflowed. adcq $0x0 , %rdx ;; Then we move the upper 64 bits of the multiplication (plus the carry bit ;; from the addition) into the third-lowest 64 bits of the output. movq %rdx , %r10 ;; Then we continue for `j = 2` and `j = 3` movq 0x38 ( %rsp ), %rax mulq 0x8 ( %rsp ) addq %rax , %r10 adcq $0x0 , %rdx movq %rdx , %r11 movq 0x38 ( %rsp ), %rax mulq ( %rsp ) addq %rax , %r11 adcq $0x0 , %rdx movq %rdx , %r12 ;; Then we do the same for `i = 1`, `i = 2` and `i = 3` movq 0x30 ( %rsp ), %rax mulq 0x18 ( %rsp ) addq %rax , %r9 adcq %rdx , %r10 adcq $0x0 , %r11 adcq $0x0 , %r12 ;; This `xor` just ensures that `%r13` is zeroed. Again, this is ;; non-optimal (we don't need to zero these registers at all) but ;; I'll get to that. xorq %r13 , %r13 adcq $0x0 , %r13 xorq %r14 , %r14 adcq $0x0 , %r14 xorq %r15 , %r15 adcq $0x0 , %r15 movq 0x30 ( %rsp ), %rax mulq 0x10 ( %rsp ) addq %rax , %r10 adcq %rdx , %r11 adcq $0x0 , %r12 adcq $0x0 , %r13 adcq $0x0 , %r14 adcq $0x0 , %r15 movq 0x30 ( %rsp ), %rax mulq 0x8 ( %rsp ) addq %rax , %r11 adcq %rdx , %r12 adcq $0x0 , %r13 adcq $0x0 , %r14 adcq $0x0 , %r15 movq 0x30 ( %rsp ), %rax mulq ( %rsp ) addq %rax , %r12 adcq %rdx , %r13 adcq $0x0 , %r14 adcq $0x0 , %r15 movq 0x28 ( %rsp ), %rax mulq 0x18 ( %rsp ) addq %rax , %r10 adcq %rdx , %r11 adcq $0x0 , %r12 adcq $0x0 , %r13 adcq $0x0 , %r14 adcq $0x0 , %r15 movq 0x28 ( %rsp ), %rax mulq 0x10 ( %rsp ) addq %rax , %r11 adcq %rdx , %r12 adcq $0x0 , %r13 adcq $0x0 , %r14 adcq $0x0 , %r15 movq 0x28 ( %rsp ), %rax mulq 0x8 ( %rsp ) addq %rax , %r12 adcq %rdx , %r13 adcq $0x0 , %r14 adcq $0x0 , %r15 movq 0x28 ( %rsp ), %rax mulq ( %rsp ) addq %rax , %r13 adcq %rdx , %r14 adcq $0x0 , %r15 movq 0x20 ( %rsp ), %rax mulq 0x18 ( %rsp ) addq %rax , %r11 adcq %rdx , %r12 adcq $0x0 , %r13 adcq $0x0 , %r14 adcq $0x0 , %r15 movq 0x20 ( %rsp ), %rax mulq 0x10 ( %rsp ) addq %rax , %r12 adcq %rdx , %r13 adcq $0x0 , %r14 adcq $0x0 , %r15 movq 0x20 ( %rsp ), %rax mulq 0x8 ( %rsp ) addq %rax , %r13 adcq %rdx , %r14 adcq $0x0 , %r15 movq 0x20 ( %rsp ), %rax mulq ( %rsp ) addq %rax , %r14 adcq %rdx , %r15 ;; Finally, we move everything out of registers so we can ;; return it on the stack movq %r8 , ( %rdi ) movq %r9 , 0x8 ( %rdi ) movq %r10 , 0x10 ( %rdi ) movq %r11 , 0x18 ( %rdi ) movq %r12 , 0x20 ( %rdi ) movq %r13 , 0x28 ( %rdi ) movq %r14 , 0x30 ( %rdi ) movq %r15 , 0x38 ( %rdi ) movq %rdi , %rax addq $0x40 , %rsp popq %r12 popq %r13 popq %r14 popq %r15 retq

So as you can see from my comments, there are a lot of inefficiencies in this code. We multiply on variables from memory instead of from registers, we do superfluous stores and loads, also the CPU has to do many stores and loads before even getting to the “real” code (the multiply-add loop), which is important because although the CPU can do loads and stores in parallel with calculations, the way that this code is written requires it to wait for everything to be loaded before it starts doing calculations. This is because the asm macro hides a lot of details. Essentially you’re telling the compiler to put the input data wherever it likes, and then to substitute wherever it put the data into your assembly code with string manipulation. The compiler stores everything into registers, but then we instruct it to put the input arrays in memory (with the "m" before the input parameters) so it loads it back into memory again. There are ways that you could write this code to remove the inefficiencies in it, but it is clearly very difficult for even a seasoned professional to write the correct code here. This code is bug-prone - if you hadn’t zeroed the output registers with the series of xor instructions then the code would fail sometimes but not always, with seemingly-random values that depended on the calling function’s internal state. It could probably be sped up by replacing "m" with "r" here (I hadn’t tested that because I only realised that this is a problem while investigating why the old assembly was so much slower in the course of writing this article), but that’s not clear from reading the source code of the program and only someone with quite in-depth knowledge of LLVM’s assembly syntax would realise that when looking at the code.

By comparison, the Rust code that uses u128 is about as say-what-you-mean as you can get. Even if your goal was not optimisation you would probably write something similar to it as the simplest solution to the problem, but the code that LLVM produces is very high-quality. You can see already that it’s not too different to our hand-written code, but it addresses some of the issues (commented below) while also including a couple more optimisations that I wouldn’t have even thought of. I couldn’t find any significant optimisations that it missed.

Here’s the generated assembly:

bigint: : U256: : full_mul: ;; Function prelude pushq %rbp movq %rsp , %rbp pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx subq $0x48 , %rsp movq 0x10 ( %rbp ), %r11 movq 0x18 ( %rbp ), %rsi movq %rsi , - 0x38 ( %rbp ) ;; I originally thought that this was a missed optimisation, ;; but it actually has to do this (instead of doing ;; `movq 0x30(%rbp), %rax`) because the `%rax` register gets ;; clobbered by the `mulq` below. This means it can multiply ;; the first element of the first array by each of the ;; elements of th without having to reload it from memory ;; like the hand-written assembly does. movq 0x30 ( %rbp ), %rcx movq %rcx , %rax ;; LLVM multiplies from a register instead of from memory mulq %r11 ;; LLVM moves `%rdx` (the upper bits) into a register, since ;; we need to operate on it further. It moves `%rax` (the ;; lower bits) directly into memory because we don't need ;; to do any further work on it. This is better than moving ;; in and out of memory like we do in the previous code. movq %rdx , %r9 movq %rax , - 0x70 ( %rbp ) movq %rcx , %rax mulq %rsi movq %rax , %rbx movq %rdx , %r8 movq 0x20 ( %rbp ), %rsi movq %rcx , %rax mulq %rsi ;; LLVM uses `%r13` as an intermediate because it needs this ;; value in `%r13` later to operate on it anyway. movq %rsi , %r13 movq %r13 , - 0x40 ( %rbp ) ;; Again, we have to operate on both the low and high bits ;; so LLVM moves them both into registers. movq %rax , %r10 movq %rdx , %r14 movq 0x28 ( %rbp ), %rdx movq %rdx , - 0x48 ( %rbp ) movq %rcx , %rax mulq %rdx movq %rax , %r12 movq %rdx , - 0x58 ( %rbp ) movq 0x38 ( %rbp ), %r15 movq %r15 , %rax mulq %r11 addq %r9 , %rbx adcq %r8 , %r10 ;; These two instructions store the flags into the `%rcx` ;; register. pushfq popq %rcx addq %rax , %rbx movq %rbx , - 0x68 ( %rbp ) adcq %rdx , %r10 ;; This stores the flags from the previous calculation into ;; `%r8`. pushfq popq %r8 ;; LLVM takes the flags back out of `%rcx` and then does an ;; add including the carry flag. This is smart. It means we ;; don't need to do the weird-looking addition of zero since ;; we combine the addition of the carry flag and the addition ;; of the number's components together into one instruction. ;; ;; It's possible that the way LLVM does it is faster on modern ;; processors, but storing this in `%rcx` is unnecessary, ;; because the flags would be at the top of the stack anyway ;; (i.e. you could remove the `popq %rcx` above and this ;; `pushq %rcx` and it would act the same). If it is slower ;; then the difference will be negligible. pushq %rcx popfq adcq %r14 , %r12 pushfq popq %rax movq %rax , - 0x50 ( %rbp ) movq %r15 , %rax movq - 0x38 ( %rbp ), %rsi mulq %rsi movq %rdx , %rbx movq %rax , %r9 addq %r10 , %r9 adcq $0x0 , %rbx pushq %r8 popfq adcq $0x0 , %rbx ;; `setb` is used instead of explicitly zeroing registers and ;; then adding the carry bit. `setb` just sets the byte at the ;; given address to 1 if the carry flag is set (since this is ;; basically a `mov` it's faster than zeroing and then adding) setb - 0x29 ( %rbp ) addq %r12 , %rbx setb %r10b movq %r15 , %rax mulq %r13 movq %rax , %r12 movq %rdx , %r8 movq 0x40 ( %rbp ), %r14 movq %r14 , %rax mulq %r11 movq %rdx , %r13 movq %rax , %rcx movq %r14 , %rax mulq %rsi movq %rdx , %rsi addq %r9 , %rcx movq %rcx , - 0x60 ( %rbp ) ;; This is essentially a hack to add `%r12` and `%rbx` and store ;; the output in `%rcx`. It's one instruction instead of the two ;; that would be otherwise required. `leaq` is the take-address-of ;; instruction, so this line is essentially the same as if you did ;; `&((void*)first)[second]` instead of `first + second` in C. In ;; assembly, though, there are no hacks. Every dirty trick is fair ;; game. leaq ( %r12 , %rbx ), %rcx ;; The rest of the code doesn't have any new tricks, just the same ;; ones repeated. adcq %rcx , %r13 pushfq popq %rcx addq %rax , %r13 adcq $0x0 , %rsi pushq %rcx popfq adcq $0x0 , %rsi setb - 0x2a ( %rbp ) orb - 0x29 ( %rbp ), %r10b addq %r12 , %rbx movzbl %r10b , %ebx adcq %r8 , %rbx setb %al movq - 0x50 ( %rbp ), %rcx pushq %rcx popfq adcq - 0x58 ( %rbp ), %rbx setb %r8b orb %al , %r8b movq %r15 , %rax mulq - 0x48 ( %rbp ) movq %rdx , %r12 movq %rax , %rcx addq %rbx , %rcx movzbl %r8b , %eax adcq %rax , %r12 addq %rsi , %rcx setb %r10b movq %r14 , %rax mulq - 0x40 ( %rbp ) movq %rax , %r8 movq %rdx , %rsi movq 0x48 ( %rbp ), %r15 movq %r15 , %rax mulq %r11 movq %rdx , %r9 movq %rax , %r11 movq %r15 , %rax mulq - 0x38 ( %rbp ) movq %rdx , %rbx addq %r13 , %r11 leaq ( %r8 , %rcx ), %rdx adcq %rdx , %r9 pushfq popq %rdx addq %rax , %r9 adcq $0x0 , %rbx pushq %rdx popfq adcq $0x0 , %rbx setb %r13b orb - 0x2a ( %rbp ), %r10b addq %r8 , %rcx movzbl %r10b , %ecx adcq %rsi , %rcx setb %al addq %r12 , %rcx setb %r8b orb %al , %r8b movq %r14 , %rax movq - 0x48 ( %rbp ), %r14 mulq %r14 movq %rdx , %r10 movq %rax , %rsi addq %rcx , %rsi movzbl %r8b , %eax adcq %rax , %r10 addq %rbx , %rsi setb %cl orb %r13b , %cl movq %r15 , %rax mulq - 0x40 ( %rbp ) movq %rdx , %rbx movq %rax , %r8 addq %rsi , %r8 movzbl %cl , %eax adcq %rax , %rbx setb %al addq %r10 , %rbx setb %cl orb %al , %cl movq %r15 , %rax mulq %r14 addq %rbx , %rax movzbl %cl , %ecx adcq %rcx , %rdx movq - 0x70 ( %rbp ), %rcx movq %rcx , ( %rdi ) movq - 0x68 ( %rbp ), %rcx movq %rcx , 0x8 ( %rdi ) movq - 0x60 ( %rbp ), %rcx movq %rcx , 0x10 ( %rdi ) movq %r11 , 0x18 ( %rdi ) movq %r9 , 0x20 ( %rdi ) movq %r8 , 0x28 ( %rdi ) movq %rax , 0x30 ( %rdi ) movq %rdx , 0x38 ( %rdi ) movq %rdi , %rax addq $0x48 , %rsp popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp retq

Although there are a few more instructions in the LLVM-generated version, the slowest type of instruction (loads and stores) are minimised, it (for the most part) avoids redundant work and it applies many cheeky optimisations on top. The end result is that the code runs significantly faster.

This is not the first time that a carefully-written Rust implementation has outperformed our assembly code - some months ago I rewrote the Rust implementations of addition and subtraction, making them outperform the assembly implementation by 20% and 15%, respectively. Those didn’t require 128-bit arithmetic to beat the assembly (to get the full power of the hardware in Rust you only need u64::checked_add / checked_sub ), although who knows - maybe in a future PR we’ll use 128-bit arithmetic and see the speed improve further still.

You can see the code from this PR here and the code from the addition/subtraction PR here. I should note that although the latter PR shows multiplication already outperforming the assembly implementation, this was actually due to a benchmark that mostly multiplied numbers with 0. Whoops. If there’s something we can learn from that, it’s that there can be no informed optimisation without representative benchmarks.

My point is not that we should take what we’ve learnt from the LLVM-generated code and write a new version of our hand-rolled assembly. The point is that optimising compilers are really good. There are very smart people working on them and computers are really good at this kind of optimisation problem (in the mathematic sense) in a way that humans find quite difficult. It’s the job of language designers to give us the tools we need to inform the optimiser as best we can as to what our true intent is, and larger integer sizes are another step towards that. Rust has done a great job of allowing programmers to write programs that are easily understandable by humans and compilers alike, and it’s just that power that has largely driven its success.