256 Bit Arithmetic Large integer types are very useful. They are used in cryptography, where the larger the type, the more possibilities, and thus exponentially more security. Another use is in fixed point arithmetic. Floating point numbers have intrinsic inaccuracy. Large bit-depth fixed point arithmetic allows expressions to be evaluated without the risk of truncation or rounding errors. They also can be the primitive base used in arbitrary precision arithmetic where the larger the integer, the less overhead. gcc has in-built 128 bit integer types on 64 bit machines. These can be accessed via __uint128_t for unsigned and __int128_t for signed 128 bit types. These are useful, because their use exposes the 64×64 bit ⇒ 128 bit multiplication assembly instruction, which would otherwise have to be described as a series of 32×32 bit ⇒ 64 bit multiplications. Unfortunately, the gcc optimizer does not handle these types very efficiently. For example the two simple routines typedef __uint128_t u128b; u128b add128b(u128b x, u128b y) { return x + y; } u128b mul128b(u128b x, u128b y) { return x * y; } compile with maximal optimization into add128b: mov %rbx,-0x10(%rsp) mov %rdi,%rax mov %rcx,%rbx mov %rdx,%rcx mov %rsi,%rdx mov %rbp,-0x8(%rsp) add %rcx,%rax mov -0x8(%rsp),%rbp adc %rbx,%rdx mov -0x10(%rsp),%rbx retq mul128b: imul %rdi,%rcx mov %rdx,%rax imul %rdx,%rsi mul %rdi add %rsi,%rcx lea (%rcx,%rdx,1),%rdi mov %rdi,%rdx retq with gcc version 4.3 This is absolutely horrible code. The problem here is that the rdx register is defined as an input register for part of the y variable by the ABI. It is also defined to be high part of the output. This confuses the gcc register allocator, and a large number of extra register-register copies are generated. In addition, this causes stack spills, which makes the code even worse. Moving to gcc version 4.4 helps a little: add128b: mov %rdx,%rax mov %rcx,%rdx push %rbx add %rdi,%rax adc %rsi,%rdx pop %rbx retq mul128b: imul %rdx,%rsi mov %rdx,%rax imul %rdi,%rcx mul %rdi add %rsi,%rcx lea (%rcx,%rdx,1),%rdx retq This code is noticeably better. However, it still contains obvious extra instructions. To get an optimal implementation, we will need to move to assembly language. Normally, this would best be done via inline gcc asm. Unfortunately, in this case that doesn't work as well as we'd like. The problem is that gcc defines the "A" register description as rax and rdx. We would like to use this combination as an output. However, we need rdx as an input to match the ABI. The overlapping register description is an error. A possible work around is to use the "A" register description as in input/output register. However, this causes the compiler to want to add an extra initialization of rax. Another work around is to specify rax and rdx separately, and construct the return 128 bit type from the results. This doesn't work either - with the compiler again creating the extra register-register moves we are trying to avoid. Thus moving to assembly is the only option. The result is add128: mov %rdx,%rax mov %rcx,%rdx add %rdi,%rax adc %rsi,%rdx retq mul128: imul %rdi, %rcx mov %rdx, %rax imul %rdx, %rsi add %rsi, %rcx mul %rdi add %rcx, %rdx retq Implementing other operations remains an exercise. So what about the next larger power of two, 256 bit integers? Here we must leave the built-in types behind and develop everything from scratch. For simplicity we will define the structure typedef unsigned long long u64b; typedef __uint128_t u128b; typedef struct u256b u256b; struct u256b { u64b lo; u64b mid; u128b hi; }; as the data layout for our 256 bit unsigned integer. This matches the endianess of the other integers, and also allows easy access to the component 64 bit and 128 bit parts we will use in the following. Implementing the addition and multiplication operations is relatively easy to do by using the inbuilt 128 bit type to handle the carries. u256b add256b(u256b *x, u256b *y) { u128b lo = (u128b) x->lo + y->lo; u128b mid = (u128b) x->mid + y->mid + (lo >> 64); u256b result = { .lo = lo, .mid = mid, .hi = x->hi + y->hi + (mid >> 64), }; return result; } /* Dumb n^2 way */ u256b mul256b(u256b *x, u256b *y) { u128b t1 = (u128b) x->lo * y->lo; u128b t2 = (u128b) x->lo * y->mid; u128b t3 = x->lo * y->hi; u128b t4 = (u128b) x->mid * y->lo; u128b t5 = (u128b) x->mid * y->mid; u64b t6 = x->mid * y->hi; u128b t7 = x->hi * y->lo; u64b t8 = x->hi * y->mid; u64b lo = t1; u128b m1 = (t1 >> 64) + (u64b)t2; u64b m2 = m1; u128b mid = (u128b) m2 + (u64b)t4; u128b hi = (t2 >> 64) + t3 + (t4 >> 64) + t5 + ((u128b) t6 << 64) + t7 + ((u128b) t8 << 64) + (m1 >> 64) + (mid >> 64); u256b result = {lo, mid, hi}; return result; } Where we have used, as the comment describes a dumb manual n2 multiplication algorithm. Expanding the 128 bit multiplications, this contains a total of six double-precision, and four single precision multiplications. Can we do better? There are several multiplication algorithms which are asymptotically better than brute-force n2 multiplication. The first of these is Karatsuba multiplication. This replaces four multiplications by three via the trick:(x 1 + t x 2 ) × (y 1 + t y 2 ) = A + t((x 1 + x 2 )(y 1 + y 2 ) - A - B) + B t2, where A = x 1 y 1 , and B = x 2 y 2 , where x and y are the two numbers to be multiplied, and expanded in terms of a base, t. This can be done recursively, halving the base at each step, yielding a total of nine multiplications for our problem. This is one better than the total of ten given above. However, these multiplications need to all be double precision. This slows the algorithm down, and the one less multiplication doesn't help. Another method of multiplying high precision numbers is called Toom-Cook multiplication. This works by noticing that the we are multiplying polynomials in t, the base. A polynomial can be described by its value at n+1 points, where n is the maximal order. i.e. if you know the value of a quadratic at three points, then you can derive its functional form. This means we can use point-wise multiplication to evaluate the polynomial multiplication. For a multiplication of n×n components (here we have 4 64 bit components to make up the 256 bit total) we require 2n-1 points, and thus 2n-1 multiplications. For our problem, this yields 7 multiplications. Unfortunately, this isn't the whole story. To go from the point values back to the polynomial expansion form, which will be our wanted result, we need to a few more mathematical operations. Unfortunately, for the case given here, we need to divide by 6 or 3. This is most efficiently done as yet another multiplication, giving a gross total of 8 double-precision multiplications. Again, this is slower than the dumb brute-force method. Why are the advanced methods slower? The reason is that we are doing a 256 bit × 256bit ⇒ 256 bit multiplication. The full result is 512 bits in length. The more advanced methods calculate this full calculation. If the brute force method were used, it would require 16 double-precision multiplications. Only requiring the half-sized result saves us quite a bit. Can the advanced methods be modified to give the lower bits we require? Unfortunately, there is no simple modification of Toom-Cook to do what we want. In the best case, we might save a single multiplication, but this isn't enough to help. The Karatsuba algorithm is another story. Here, it is indeed possible to modify it to lower the number of required multiplications significantly. Expanding the terms as x = x 1 +tx 2 +t2x 3 +t3x 4 , y = y 1 +ty 2 +t2y 3 +t3y 4 , and z = x×y = z 1 +tz 2 +t2z 3 +t3z 4 we can rewrite the multiplication as z 1 = x 1 y 1 z 2 = (x 1 + x 2 )(y 1 + y 2 ) - x 1 y 1 - x 2 y 2 z 3 = (x 1 + x 3 )(y 1 + y 3 ) - x 1 y 1 - x 3 y 3 + x 2 y 2 z 4 = (x 2 + x 3 )(y 2 + y 3 ) - x 2 y 2 - x 3 y 3 +x 1 y 4 +x 4 y 1 If the values calculated multiple times are reused we require a total of 8 multiplies, with five of them double precision, and three single precision. This is lower than the brute force method, so may just be faster. Unfortunately, we cannot use the algorithm shown above as is. The problem is overflow. The additions cause us to have to multiply 65 bit numbers together. This can be fixed by using subtractions instead. Since of a pair of numbers, one must be larger than or equal to another, we can order the subtraction so that the result will always fit in 64 bits. For one set of orderings we might have: z 1 = x 1 y 1 z 2 = (x 1 - x 2 )(y 2 - y 1 ) + x 1 y 1 + x 2 y 2 z 3 = (x 1 - x 3 )(y 3 - y 1 ) + x 1 y 1 + x 2 y 2 + x 3 y 3 z 4 = (x 2 - x 3 )(y 3 - y 2 ) + x 2 y 2 + x 3 y 3 +x 1 y 4 +x 4 y 1 To do the 65 bit multiply, we need to check the orderings of the terms: u128b mul65(u64b x1, u64b x2, u64b y1, u64b y2, u64b *carry) { *carry = 0; if (x1 > x2) { u64b xx12 = x1 - x2; if (y1 > y2) { u64b yy12 = y1 - y2; u128b xy12 = (u128b) xx12 * yy12; if (xy12) *carry -= 1; return -xy12; } else { u64b yy12 = y2 - y1; u128b xy12 = (u128b) xx12 * yy12; return xy12; } } else { u64b xx12 = x2 - x1; if (y1 > y2) { u64b yy12 = y1 - y2; u128b xy12 = (u128b) xx12 * yy12; return xy12; } else { u64b yy12 = y2 - y1; u128b xy12 = (u128b) xx12 * yy12; if (xy12) *carry -= 1; return -xy12; } } } Note how we also need to return the "carry" which is actually a borrow. This is required when the result is negative. We need to return a 192 bit two's complement integer to get the result correct. The above isn't quite as efficient as it might be though. A little optimization yields the simpler: u128b mul65b(u64b x1, u64b x2, u64b y1, u64b y2, u64b *carry) { u64b c = 0; u64b t1 = x1 - x2; u64b t2 = y2 - y1; u128b result = (u128b) t1 * t2; if ((x2 > x1) && t2) { result -= (u128b) t2 << 64; c = ~c; } if ((y1 > y2) && t1) { result -= (u128b) t1 << 64; c = ~c; } *carry = c; return result; } Unfortunately, gcc isn't all that good at compiling the above. To get a better result, assembly is required again: mul65c: mov %rdi, %rax sub %rsi, %rax sbb %rsi, %rsi sub %rdx, %rcx sbb %rdi, %rdi mov %rsi, %r9 and %rcx, %rsi xor %rdi, %r9 and %rax, %rdi mul %rcx # Correct for overflow sub %rsi, %rdx sub %rdi, %rdx # fixup carry mov %rax, %rdi or %rdx, %rdi cmove %rdi, %r9 mov %r9, (%r8) retq This uses a relation between signed and unsigned multiplication, together with the "sbb" trick to have a branchless function. This can be further simplified in the case where the carry (borrow) isn't required: mul65d: mov %rdi, %rax sub %rsi, %rax sbb %rsi, %rsi sub %rdx, %rcx sbb %rdi, %rdi and %rcx, %rsi and %rax, %rdi mul %rcx # Correct for overflow sub %rsi, %rdx sub %rdi, %rdx retq Some C code which can use these routines to calculate the 256 bit multiplication via the pseudo-Karatsuba algorithm is: u256b mul256c(u256b *x, u256b *y) { u256b result; u64b x1 = x->lo; u64b x2 = x->mid; u64b x3 = x->hi; u64b x4 = x->hi >> 64; u64b y1 = y->lo; u64b y2 = y->mid; u64b y3 = y->hi; u64b y4 = y->hi >> 64; u128b xy11 = (u128b) x1 * y1; u128b xy22 = (u128b) x2 * y2; u128b xy33 = (u128b) x3 * y3; u64b xy14 = x1 * y4 + x4 * y1; u64b xy23 = (x3 - x2)*(y2 - y3); u64b carry; u128b t1 = xy11; u128b t2 = mul65c(x1, x2, y1, y2, &carry); u128b t3 = mul65d(x1, x3, y1, y3) + xy11 + xy22 + xy33 + (t2 >> 64); u64b t4 = xy14 + xy23 + xy22 + xy33 + carry; t2 = (u64b) t2; t2 += t1 >> 64; t3 += t2 >> 64; t2 = (u64b) t2; t2 += xy11; t3 += t2 >> 64; t2 = (u64b) t2; t2 += xy22; t3 += t2 >> 64; result.lo = t1; result.mid = t2; result.hi = t3 + ((u128b) t4 << 64); return result; } The normal method takes 10.9 seconds to evaluate 228 multiplications, and the pseudo-Karatsuba algorithm takes 8.8 seconds. (Using gcc 4.4, which generates the better code.) This is approximately the 8/10 speedup predicted. Now how much faster can we make them go if we use full assembly versions? The pseudo-Karatsuba algorithm is rather complex, so is rather long in machine code: mul256pseudok: mov 0x10(%rdx), %rax mov %rbp, -0x8(%rsp) mov (%rsi), %r11 mov (%rdx), %r8 mov %rbx, -0x10(%rsp) mov 0x18(%rdx), %rcx mov %r12, -0x18(%rsp) mov 0x10(%rsi), %rbx imul %r11, %rcx #xy13 mov %r11, %rbp mov %rax, %r10 sub %r8, %rax sbb %r12, %r12 sub %rbx, %r11 sbb %r9, %r9 and %r11, %r12 and %rax, %r9 sub %r12, %rcx sub %r9, %rcx mov 0x8(%rdx), %r9 mul %r11 mov %rax, %r12 add %rdx, %rcx #xy33 mov %rbx, %rax mul %r10 add %rax, %r12 adc %rax, %rcx add %rdx, %rcx #xy11 + xy14 mov %rbp, %rax mul %r8 mov %rax, (%rdi) mov %rax, %r11 add %rax, %r12 adc %rdx, %rcx mov %r8, %rax imul 0x18(%rsi), %rax add %rdx, %r11 adc %rdx, %r12 adc %rax, %rcx #xy23 mov %rbp, %rdx mov 0x8(%rsi), %rbp mov %r9, %rax sub %rbp, %rbx sub %r10, %rax mov %rdx, %rsi imul %rax, %rbx add %rbx, %rcx #xy22 mov %rbp, %rax mul %r9 add %rax, %r11 adc %rax, %r12 adc %rax, %rcx add %rdx, %r12 adc %rdx, %rcx #xy12 mov %r8, %rax sub %r9, %rax sbb %r8, %r8 sub %rsi, %rbp sbb %rsi, %rsi mov %r8, %rbx and %rbp, %r8 xor %rsi, %rbx and %rax, %rsi mul %rbp mov -0x8(%rsp), %rbp mov %rax, %r9 sub %r8, %rdx sub %rsi, %rdx or %rdx, %r9 cmove %r9, %rbx add %rax, %r11 mov %r11, 0x8(%rdi) adc %rdx, %r12 mov %r12, 0x10(%rdi) mov -0x18(%rsp), %r12 adc %rbx, %rcx mov %rcx, 0x18(%rdi) mov -0x10(%rsp), %rbx retq The brute force method simplifies much more: mul256brute: mov (%rsi), %rax mov (%rdx), %r8 mov 0x8(%rdx), %r9 mov %rbp, -0x8(%rsp) mov %rax, %rbp mov 0x18(%rdx), %rcx imul %rbp, %rcx mov 0x10(%rdx), %r10 mul %r8 mov %rax, (%rdi) mov %rdx, %r11 mov %rbp, %rax mul %r10 mov %r12, -0x10(%rsp) mov %rax, %r12 add %rdx, %rcx mov %rbp, %rax mul %r9 add %rax, %r11 mov 0x8(%rsi), %rbp adc %rdx, %r12 adc $0, %rcx imul %rbp, %r10 mov %rbp, %rax mul %r8 add %rax, %r11 mov %rbp, %rax adc %rdx, %r12 adc %r10, %rcx mov 0x10(%rsi), %rbp mul %r9 mov %r11, 0x8(%rdi) imul %rbp, %r9 add %rax, %r12 mov %rbp, %rax adc %rdx, %rcx mov 0x18(%rsi), %rbp imul %r8, %rbp add %rbp, %rcx mov -0x8(%rsp), %rbp mul %r8 add %rax, %r12 mov %r12, 0x10(%rdi) mov -0x10(%rsp), %r12 adc %r9, %rcx add %rdx, %rcx mov %rcx, 0x18(%rdi) retq Running the multipliers again for 228 times gives a time of 5.5 seconds for the assembly optimized pseudo-Karatsuba algorithm. However, doing this for the brute force method yields a time of 4.0 seconds. Thus the dumb brute force method is faster overall. Why is this? The reason is that the more complex algorithm requires many more additions and subtractions. The time taken for these adds up, and overwhelms the small advantage in lower number of multiplies. The end result is code which is more than twice as fast than the original C code. The reason we can speed it up so much is because even in version 4.4, gcc does not generate very good code for 128 bit types. Hopefully version 4.5 will be better. Finally, the optimal 256 bit adder is: add256: mov (%rdx), %r8 mov 0x8(%rdx), %r9 mov 0x10(%rdx), %rcx mov 0x18(%rdx), %rax add (%rsi), %r8 adc 0x8(%rsi), %r9 adc 0x10(%rsi), %rcx adc 0x18(%rsi), %rax mov %r8, (%rdi) mov %r9, 0x8(%rdi) mov %rcx, 0x10(%rdi) mov %rax, 0x18(%rdi) retq The horrible captcha has been updated. For prosperity, the evil old one used to look like: The new one should be a little easier for humans to read, and a bit harder for machines. Lockless

Articles

256 Bit Arithmetic