I recently transitioned to Visual Studio 2017 and while it went relatively painless, the new and improved optimizer uncovered some subtle issues lurking in the code (to be fair, Clang/GCC has been behaving same way for a long time now). The code is question was actually quite ancient and originated from this Devmaster forum post (gone now, but found if using The Wayback Machine): Fast and accurate sine/cosine. To be more precise, it was this version with ‘fast wrapping’:

x = x * ( 1.0f / PI); // Wrap around float z = (x + 25165824.0f ); x = x - (z - 25165824.0f ); #if LOW_SINE_PRECISION return 4.0f * (x - x * abs(x)); #else [..not relevant...]

(sidenote: for those who don’t remember, the author (Nick) was also the guy behind Swift Shader, a 100% software/CPU based implementation of OpenGL/DirectX9 and someone who knows low-level/optimization better than 99.99% of the population))

Let’s focus on the ‘wrap around’ part. It’s actually a smart trick relying on knowing the details of IEEE-754 floating-point format. We only have 23 bits of mantissa (well, 24 if you count the implied leading one). If we add a huge number (like 25165824), that pushes us over 24 bits and we lose 1 bit which means we can only represent even numbers (16777216 would be enough if we only cared about positive numbers). What happens is we wrap around the -1,1 range (so for example 1.5 will come out as -0.5, 2.0 will be 0, 1 is still 1 etc). It’s a neat trick, if you’re interested in details, just plug in a bunch of numbers and observe. The “problem” here is that compiler is smart enough to notice that we adding/subtracting the same number and with /fp:fast enabled (which is what you want if you’re writing games), we’re allowing it to treat it as a no-op. Now, interestingly enough, the original version actually took this into account and used volatile to stop the optimizer from removing the store/load, ie:

volatile float z = (x + 25165824.0f );

Seems like there were more souls who thought volatile was not necessary and concerned about perf: https://github.com/OpenImageIO/oiio/issues/963

Compiler Explorer example here (try removing/adding volatile to see the effect). Interestingly CE’s version of Visual Studio 2015/17 doesn’t optimize it out, but Clang/GCC should work. With fast math enabled, all we get is:

1 foo ( float ): # @foo(float) 2 xorps xmm0 , xmm0 3 ret

What should happen (VS2015/fast math disabled) is (only relevant fragment):

1 movss xmm0 , DWORD PTR _x$ [ esp-4 ] 2 cvtps2pd xmm0 , xmm0 3 mulsd xmm0 , QWORD PTR __real@3fd45f306dc9c883 // * ( 1 .0 / PI ) 4 cvtpd2ps xmm2 , xmm0 5 movaps xmm1 , xmm2 6 addss xmm1 , DWORD PTR __real@4bc00000 7 subss xmm1 , DWORD PTR __real@4bc00000 8 subss xmm2 , xmm1

With fast math and volatile (Clang):

1 addss xmm0 , dword ptr [ rip + .LCPI0_1 ] 2 movss dword ptr [ rsp - 4 ], xmm0 3 movss xmm1 , dword ptr [ rsp - 4 ] # xmm1 = mem[0],zero,zero,zero 4 subss xmm0 , xmm1

With fast math and volatile (VS2017):

1 addss xmm1 , dword ptr [ __real@4bc00000 ( 07 FF7D2672410h )] 2 movss dword ptr [ rsp + 8 ], xmm1 3 movss xmm1 , dword ptr [ z ] 4 subss xmm1 , dword ptr [ __real@4bc00000 ( 07 FF7D2672410h )] 5 subss xmm0 , xmm1

You might be a bit concerned about the store/load, but I wouldn’t worry about it too much, we don’t really have to wait for it, store forwarding should save us here. Purely out of curiosity, we still decided to see if we can persuade the compiler to generate a bit tighter code. One thing that was tried was moving volatile:

volatile float BIG_C = 25165824.0f ; [...] float z = (x + BIG_C); x = x - (z - BIG_C);

That was enough to fool MSVC:

1 movaps xmm2 , xmm0 2 addss xmm2 , dword ptr [ BIG_C ( 07 FF60EB6A008h )] 3 movss xmm1 , dword ptr [ BIG_C ( 07 FF60EB6A008h )] 4 subss xmm2 , xmm1 5 subss xmm0 , xmm2

…but Clang could see right through our ruse, that one caused a bit of head scratching, it actually left some of the code, but removed the bit where we divide by PI (mulsd in one of the snippets above):

1 foo ( float ): # @foo(float) 2 movss xmm1 , dword ptr [ rip + BIG_C ] # xmm1 = mem[0],zero,zero,zero 3 movss xmm0 , dword ptr [ rip + BIG_C ] # xmm0 = mem[0],zero,zero,zero 4 subss xmm0 , xmm1

Why? Well, it was smart enough to notice the expression is something like:

x = x - (x + C - C) = (x + C) - (x + C)

It didn’t really matter all that much, as mentioned, it was purely a fun exercise, but I was still quite impressed.