EDIT: I never expected this post to garner so much interest. Just to be clear, using the constant is a very bad idea in almost every situation I can think of. Even instructions like RCPSS should only be used by people who are convinced that the tradeoffs are worth it (or don’t have any other choice). Subtracting from the constant above is even worse, and handles even fewer corner cases.

I’ve been working on a specialised discrete optimisation algorithm, and perf tells me that it’s spending around 25% of its time on one (double float) division instruction: the algorithm has to repeatedly compute x-1 for a couple hundred thousand values in a vector. If the denominator was constant, it would be simple to speed that up. Unfortunately, the only shortcut for reciprocals seems to consist of computing an approximation and refining it with a couple Newton-Raphson steps.

Surprisingly, it’s still possible to achieve significant speed-ups when as many as 3 refinement steps are used: a division unit takes a lot of space, so, unlike multiplications and additions, hardware divisions can rarely be executed in parallel. I.e., each division takes a couple cycles (around 7-14, according to Agner Fog’s tables [PDF]), and must wait for the previous division to be fully computed. Since I’m dividing a large number of values one after the other, I’m looking for good throughput more than latency, and it can be useful to perform more operations, as long as they’re executed in parallel or in a pipeline.

The textbook way to compute an approximation of the reciprocal is probably to use hardware support. On x86[-64] , that would be RCPSS , which computes a 12 bit approximation for the reciprocal of a single float. For my use case, I also have to convert to and from doubles ( CVTSD2SD and CVTSS2SD ). My Westmere machine can dispatch one of these instructions per cycle.

But then again, it seems like it would be possible to do something like the classic fast inverse square root. Simply taking the opposite value of the exponent field gives us a decent approximation: double floats on x86 (and on most other machines) are represented as

sign : exponent : significand 1 bit: 11 bits : 52 bits (and implicit leading 1)

The exponent is signed, but represented as unsigned numbers with a positive bias: an exponent of 0 is represented by 1022, and -1 by 1021, 1 by 1023, etc. The maximal biased value for a normal number is 2046 (2047 is used for infinities and NaNs). So, to get the opposite value of the exponent, it suffices to subtract it from 2046. That gives us a single bit of precision which, while sufficient to guarantee convergence of the Newton-Raphson steps, is far from useful.

Instead, it would be interesting to also exploit the significand to get a couple more bits of precision from a subtraction like to-double(a - from-double(x)). The Newton-Raphson steps can diverge when the initial guess is too large, so the constant a should be of the form 2046 ⋅ 252- b, where b [0,252- 1]. I graphed the maximal error on a range of inputs (217 equidistant values between 0 and 3), and it looks unimodal (decreasing until it hits the global minimum, then increasing).

I want to minimise a function that seems unimodal, but isn’t (a priori) differentiable and is fairly expensive to compute. That seemed like a perfect application for the golden section search (to avoid any problem with floating point rounding errors, I used an exhaustive search once the range was down to a couple thousand values). I waited a couple seconds, and my REPL spit out 0x7FDE623822FC16E6 for the best value of a: the max relative error after 3 steps is on the order of 4 ⋅ 10-11 (≈ 35 bit), and the average around 1 ⋅ 10-11. The convergence rate of the Newton-Raphson methods is quadratic, so the magic constant yields slightly more than 4(!!) bits of precision after a single subtraction.