Optimizing std::hypot for simd arguments

Do you know about std::hypot(a, b) and std::hypot(a, b, c) ? (The 3-argument overload exists since C++17, sometimes referred to as hypot3.) Why do the C and C++ standard libraries provide a function that is as simple as sqrt(a*a + b*b) ? It doesn’t save enough characters to warrant its existence, right? Have you ever considered what happens if the input values are “very” small or large? Or if an input is an IEEE754 special value such as infinity or NaN? Have you ever considered how precise the calculation is, especially if an exact answer is obvious if one of the inputs is 0?

Pythagoras — it’s complicated

The standard function is specified to avoid overflow and underflow in intermediate calculations. Consider the following example:

float a = 0x1 p70f ; // = 2⁷⁰ (~1.2e21) float b = 0 ; float r = std :: sqrt ( a * a + b * b ); // = inf

a * a thus is 0x1p140f (~1.4e42) which overflows the 8-bit exponent (subnormal, -126 – 127, inf/nan) for single precision. However, mathematically it’s obvious that r = a is the right answer, not r = inf . The naïve sqrt(a*a + b*b) thus doesn’t work for inputs greater than 0x1.fffffep63f (~1.8e19) (if the other input is 0, otherwise the cut-off is even lower) as well as for inputs smaller than (let’s ignore subnormals for now) 0x1p-63f (~1e-19).

Regarding precision, sqrt(a*a + b*b) isn’t really bad since std::sqrt is required by IEEE754 to have .5 ULP precision and can thus potentially reduce the error of a*a + b*b (where each operation has .5 ULP precision). I’d expect a total error within 1 ULP (if anyone can give me a good pointer to a formalism of fp error analysis, please do). The precision issue becomes relevant when you want to support the full range of possible input values without over-/underflow.

How to avoid over-/underflow

The idea to support the full range of input values is a simple transformation where . Consequently and thus won’t overflow. If underflows, we didn’t have to care about its value anyway because the addition with 1 would have discarded all bits anyway.

1. Optimization idea

Instead of factoring out , we can use a value that is of the same magnitude but might reduce the errors introduced through intermediate calculations. The most obvious candidate is to round down to the next power-of-2 value, which I’ll call . This is a trivial bitwise and operation, setting all mantissa bits to 0. The resulting floating point value will only scale the exponent bits when used in a multiplication or division. Our hypot implementation thus calculates . The final multiplication ( ) and both fractions ( and ) are without loss of precision. Thus, the resulting error is equivalent to the error of .

2. Division is slow

Division and square root instructions run on the divider port (pipeline) of the CPU and will therefore dominate the execution time of the hypot implementation. We cannot avoid the square root, but we can avoid the division. The first idea is to turn the two divisions into one division and two multiplications: , with .

The division is a trivial operation, though, because it just needs to flip the sign of the exponent. After all, was constructed to be a power-of-2 value: . If you recall how the exponent is stored in an IEEE754 floating point value, you might realize that flipping all exponent bits almost produces the correct result. It’s just off by one. And if the exponent is off by one, the resulting floating point value is off by a factor of 2.

^ (or in code: b_hat = ((b & infinity) ^ infinity) * .5 ; the xor operator is not defined for float , but you get the idea). Note that using an addition instead of multiply by two is faster for two reasons (An optimizing compiler will turn 2*x into x+x by itself, so feel free to forget about this optimization.):

it doesn’t require loading a constant value ( 0x4000'0000 in this case); addition instructions typically have a lower latency than multiplication instructions.

is another trivial bit operation if we look closely. was constructed to store the exponent of . Thus is with its exponent set to . Using a bitwise-and and bitwise-or operation, we can easily overwrite the exponent of to avoid the multiplication (b & (min - denorm_min)) | 1 . The CPU can therefore schedule earlier and thus also execute the following FMA and square root earlier, reducing the total latency of the hypot function.

Subnormal, NaN, Zero, and Infinity

There are corner cases. A proper implementation must care for the Annex F requirements (C Standard, but imported into C++). As is often the case with floating point, the .01% corner cases lead to significant effort in the implementation, and if done carelessly to unfortunate slowdown. I’ll let you figure it out from the implementation below.

I was talking about std::experimental::simd<float>

You probably didn’t notice other than by the title, but everything I said I didn’t write for plain float but for std::experimental::simd<T, Abi> (last draft). Here’s my implementation, targeting libstdc++:

template < typename T , typename Abi > simd < T , Abi > hypot ( const simd < T , Abi >& x , const simd < T , Abi >& y ) { if constexpr ( simd < T , Abi >:: size () == 1 ) { return std :: hypot ( T ( x [ 0 ]), T ( y [ 0 ])); } else if constexpr ( is_fixed_size_abi_v < Abi > ) { return fixed_size_apply < simd < T , Abi >> ( []( auto a , auto b ) { return hypot ( a , b ); }, x , y ); } else { using namespace __proposed :: float_bitwise_operators ; using Limits = std :: numeric_limits < T > ; using V = simd < T , Abi > ; V absx = abs ( x ); V absy = abs ( y ); V hi = max ( absx , absy ); V lo = min ( absy , absx ); constexpr V inf ( Limits :: infinity ()); // if hi is subnormal, avoid scaling by inf & final mul by 0 (which // yields NaN) by using min() V scale = 1 / Limits :: min (); // invert exponent w/o error and w/o using the slow divider unit: where ( hi > Limits :: min (), scale ) = (( hi & inf ) ^ inf ) * T ( .5 ); // adjust final exponent for subnormal inputs V hi_exp = Limits :: min (); where ( hi > Limits :: min (), hi_exp ) = hi & inf ; constexpr V mant_mask = Limits :: min () - Limits :: denorm_min (); V h1 = ( hi & mant_mask ) | V ( 1 ); where ( hi < Limits :: min (), h1 ) -= V ( 1 ); V l1 = lo * scale ; V r = hi_exp * sqrt ( h1 * h1 + l1 * l1 ); #ifdef STDC_IEC_559__ // fixup for Annex F requirements V fixup = hi ; where ( isunordered ( x , y ), fixup ) = Limits :: quiet_NaN (); where ( isinf ( absx ) || isinf ( absy ), fixup ) = inf ; where ( ! ( lo == 0 || isunordered ( x , y ) || isinf ( absx ) || isinf ( absy )), fixup ) = r ; r = fixup ; #endif return r ; } }

Does it fly?

First of all, I made sure it is conforming and within 1 ULP of a precise implementation.

I tried to measure both latency and throughput. Measurements details:

Intel Xeon W-2123 @ 3.60GHz

GCC 9

-O2 -march=native (i.e. using AVX512VL instructions for xmm and ymm vectors)

(i.e. using AVX512VL instructions for and vectors) Intel pstate set to no_turbo and scaling_governor set to performance

and set to Ubuntu 18.10

Throughput

T hypot(T, T) TSC cycles/call Speedup per value relative to T float 17.1208 simd<float, scalar> 17.218 0.994355 simd<float, __sse> 15.5264 4.41074 simd<float, __avx> 15.9294 8.59831 simd<float, __avx512> 18.5829 14.7411 double 30.0783 simd<double, scalar> 30.2174 0.995398 simd<double, __sse> 15.5236 3.87517 simd<double, __avx> 16.9682 7.09053 simd<double, __avx512> 26.4416 9.1003

Latency

T hypot(T, T) TSC cycles/call Speedup per value relative to T float 37.6397 simd<float, scalar> 38.4039 0.980102 simd<float, __sse> 41.2072 3.6537 simd<float, __avx> 41.4811 7.25916 simd<float, __avx512> 53.6256 11.2304 double 51.806 simd<double, scalar> 51.8265 0.999604 simd<double, __sse> 42.0938 2.46145 simd<double, __avx> 42.6324 4.86072 simd<double, __avx512> 57.4461 7.21455

I can also disable the Annex F requirements via -ffast-math . Let’s see how that changes the results:

Throughput (-ffast-math)

T hypot(T, T) TSC cycles/call Speedup per value relative to T float 12.0874 simd<float, scalar> 11.572 1.04453 simd<float, __sse> 11.0433 4.37815 simd<float, __avx> 11.6043 8.33306 simd<float, __avx512> 13.6978 14.1189 double 27.1595 simd<double, scalar> 26.0718 1.04172 simd<double, __sse> 11.6566 4.65993 simd<double, __avx> 13.2002 8.23001 simd<double, __avx512> 25.5888 8.49107

Latency (-ffast-math)

T hypot(T, T) TSC cycles/call Speedup per value relative to T float 36.8928 simd<float, scalar> 37.1042 0.994301 simd<float, __sse> 41.357 3.56822 simd<float, __avx> 41.7432 7.07042 simd<float, __avx512> 52.2087 11.3062 double 51.3563 simd<double, scalar> 50.9219 1.00853 simd<double, __sse> 42.4579 2.41916 simd<double, __avx> 42.326 4.85341 simd<double, __avx512> 56.3338 7.29314

Discuss on Reddit