Řrřlog::Improving the fast inverse square root

Remember that famous hack for approximating y = x−1/2 from Quake III? It goes like this:

float inv_sqrt(float x) { union { float f; uint32 u; } y = {x}; y .u = 0x5F3759DFul - (y .u >> 1 ); // Magic! return 0.5f * y .f * ( 3.0f - x * y .f * y .f ); }

How does it work. Aliasing float x as an integer gives a nice approximation of p + q log 2 x (for p ≈ 230−223 and q = 223). In this light, the magic line can be interpreted as

p + q log 2 y = p + q log 2 x−1/2 = p + q (−0.5 log 2 x) = 1.5p − (p + q log 2 x) / 2.

Finally, we refine the result by applying one iteration of the Newton-Raphson method:

y = x−1/2 : find solution to f(y) = y−2 − x = 0

y ← y − f(y) / f'(y)

y ← y − (y−2 − x) / (−2y−3) = (2y + y − xy3) / 2 = 0.5 y (3 − xy2)

Improvements. Chris Lomont studied this hack and found out that the constant 0x5F375A86 minimizes the maximum relative error |1 − inv_sqrt(x) / x−1/2| over all normalized float s. Charles McEniry has a really nice analysis and the optimal constant for double s.

3-D optimization

Can we do any better without making it slower? Let’s forget the explanation for a while and take a look at the raw code. There is not only one, but three constants to optimize!

To find the best values, I’ve implemented an aggressive variant of the DIRECT optimizer:

Assume we’re minimizing f(x) over the range (0,1)D. We keep a pool of boxes with evaluated centers. Every box has a level and its sides can be long or short (long = 3·short = 3−level). “Best” means “having the smallest evaluation”.

Init. Evaluate the first box – it covers the whole parameter range, so its level = 0.

Choose. For the best box from every level < maxlevel:

Sample. For every long dimension d of the box, evaluate f(center 1 , center 2 , …, center d ± 3−level, …, center D ).

Split. Repeatedly split the central box into thirds along the best dimension, adding the sides into the pool with level unchanged. The three boxes in the final split are added into level+1.

Repeat. If the best function value hasn’t improved for a long time, try local search or deepen maxlevel. Go back to Choosing.

Local search. Add Gaussian random numbers to the best parameters and evaluate them. Try several variation scales.

Speedups. Relative errors wrap around and need to be computed only for the interval [1, 4). A good estimate can be obtained by sampling only every 512th floating point value, then taking the top 2% and thoroughly searching their neighborhoods. This estimate hasn’t made a single mistake during the whole search.

Results

Accuracy has improved almost 2.7-fold. I’ve also tried to optimize for the minimum average squared relative error (the least squares criterion all the cool kids are using).

C1 C2 C3 max rel_error avg rel_error2 original 0x5F3759DF 0.5 3.0 1.75233867·10−3 1.24792411·10−6 Lomont 0x5F375A86 0.5 3.0 1.75130156·10−3 1.24936147·10−6 least squares 0x5F1AD0A1 0.755897697 2.27828001 1.14832618·10−3 1.26897912·10−7 minimax 0x5F1FFF77 0.703974056 2.38919526 6.50197782·10−4 2.00005877·10−7

The relative error values are quite noisy – it took hundreds of restarts (with perturbed starting bounds) to get these results. If you want to search for improvements, another good region for minimax is around ( 0x5F601800 , 0.2485, 4.7832). The program is available for download.

Future thoughts. It would be nice to have a version for double s with one and two Newton iterations. A Halley iteration also looks nice, but involves a division: y ← y (xy2 + 3) / (3xy2 + 1).

Similar fast approximations can be used for any x−1/2k. Since the program is already written, it might be interesting looking into it ;-).

Update: I made a looong exhaustive search around the best solution and found a tiny improvement:

C1 C2 C3 max rel_error avg rel_error2 old 0x5F1FFF77 0.703974056 2.38919526 6.50197782·10−4 2.00005877·10−7 new 0x5F1FFFF9 0.703952253 2.38924456 6.50196699·10−4 2.00010826·10−7

The new code:

float inv_sqrt(float x) { union { float f; uint32 u; } y = {x}; y .u = 0x5F1FFFF9ul - (y .u >> 1 ); return 0.703952253f * y .f * ( 2.38924456f - x * y .f * y .f ); }

I’ve also found that Pizer had the exact same idea (but wasn’t so obsessed with finding the optimal solution).