These two programs — compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor — demonstrate an anomaly that occurs when using single-precision floating point variables:

Program 1

#include "stdio.h" int main (void) { float f1 = 0.1f, f2 = 3.0f, f3; f3 = f1 * f2; if (f3 != f1 * f2) printf("Not equal

"); }

Prints “Not equal”.

Program 2

#include "stdio.h" int main (void) { float f1 = 0.7f, f2 = 10.0f, f3; int i1, i2; f3 = f1 * f2; i1 = (int)f3; i2 = (int)(f1 * f2); if (i1 != i2) printf("Not equal

"); }

Prints “Not equal”.

In each case, f3 and f1 * f2 differ. But why? I’ll explain what’s going on.

(This article was inspired by this question and several related questions on stackoverflow.com.)

Analyzing Program 1

Program 1 compares two floating point values for equality — something conventional wisdom says not to do. But why should that rule apply in this case? We’re just checking that a variable holds the value we just gave it. How could it not?

The root of the “problem” is this: the compiler generates instructions that do floating-point calculations in extended precision. Whereas floats are 4 bytes long and have 24 significant bits of precision, extended precision values — which are stored in registers on the floating point stack — are 10 bytes long and have 64 bits of precision.

Computing f3

Using my function fp2bin, I printed the binary values of the three floating point variables in the computation:

f1 = 0.000110011001100110011001101

f2 = 11

f3 = 0.010011001100110011001101

The value of f3 is only an approximation to f1 * f2. To understand why, let’s calculate the true value of f1 * f2; that is, let’s calculate it by hand, using binary multiplication:

0.000110011001100110011001101 x 11 ----------------------------- 110011001100110011001101 110011001100110011001101 ------------------------------- 0.010011001100110011001100111

(You can also compute this with my binary calculator.)

So f1 * f2 = 0.010011001100110011001100111. It fits comfortably within extended precision. But to assign it to f3 — a float — it must be rounded. f1 * f2 has 26 significant bits, but a float holds only 24. Rounding it to the nearest 24 bit value makes it 0.010011001100110011001101.

The assembler code generated by the compiler confirms what we’re seeing:

f3 = f1 * f2; 0041130B fld dword ptr [f1] 0041130E fmul dword ptr [f2] 00411311 fstp dword ptr [f3]

f1 * f2 is computed in extended precision — enough bits to hold its true value — but that extra precision is lost when stored in f3.

Comparing f3 and f1 * f2

The answer is in the assembler code, so let’s get right to it:

if (f3 != f1 * f2) 00411314 fld dword ptr [f1] 00411317 fmul dword ptr [f2] 0041131A fld dword ptr [f3] 0041131D fucompp

Again, f1 * f2 is computed in extended precision, but this time its true value is retained — it is left on the stack. f3 is then loaded onto the stack (of course it still has only 24 bits of precision, even though it’s been “promoted” to extended precision). The two values on top of the stack are then compared. Clearly, they differ.

Analyzing Program 2

Program 2 “fails” for the same reason as program 1, except that the “error” is magnified by the conversion of f3 and f1 * f2 to integers: the integer part of f3 is 7, and the integer part of f1 * f2 is 6.

Computing i1

The binary values of the three floating point variables in the computation are:

f1 = 0.101100110011001100110011

f2 = 1010

f3 = 111

The value of f3 is an integer. To understand why, let’s calculate the true value of f1 * f2:

0.101100110011001100110011 x 1010 -------------------------- 0 101100110011001100110011 0 101100110011001100110011 ---------------------------- 110.11111111111111111111111

So f1 * f2 = 110.11111111111111111111111. To assign it to f3 it must be rounded. f1 * f2 has 26 significant bits, and rounding it to the nearest 24 bit value makes it 111, or 7 decimal. Clearly, this means i1 will be 7 as well.

Here’s the assembler code:

f3 = f1 * f2; 0041130B fld dword ptr [f1] 0041130E fmul dword ptr [f2] 00411311 fstp dword ptr [f3] i1 = (int)f3; 00411314 fld dword ptr [f3] 00411317 call @ILT+155(__ftol2_sse) (4110A0h) || || \/ --- f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\ftol2.asm 00411600 cmp dword ptr [___sse2_available (416554h)],0 00411607 je _ftol2 (411636h) 00411609 push ebp 0041160A mov ebp,esp 0041160C sub esp,8 0041160F and esp,0FFFFFFF8h 00411612 fstp qword ptr [esp] 00411615 cvttsd2si eax,mmword ptr [esp] 0041161A leave 0041161B ret

Computing i2

Here’s the assembler code for computing i2:

i2 = (int)(f1 * f2); 0041131F fld dword ptr [f1] 00411322 fmul dword ptr [f2] 00411325 call @ILT+155(__ftol2_sse) (4110A0h) || || \/ --- f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\ftol2.asm 00411600 cmp dword ptr [___sse2_available (416554h)],0 00411607 je _ftol2 (411636h) 00411609 push ebp 0041160A mov ebp,esp 0041160C sub esp,8 0041160F and esp,0FFFFFFF8h 00411612 fstp qword ptr [esp] 00411615 cvttsd2si eax,mmword ptr [esp] 0041161A leave 0041161B ret

The interesting thing in the calculation of i2 is that f1 * f2 is stored in a double before being cast to an integer (see the highlighted fstp instruction). f1 * f2 in double precision is 110.11111111111111111111111; that is, its true 26 bit value. Casting this to an integer results in 110, or decimal 6. If the compiler had stored f1 * f2 in a float and used the cvttss2si instruction instead, the answer would have been 7 — just like in the first calculation.

Discussion

In program 2, it would appear that single-precision is more accurate than extended precision. After all, forcing the value into a float gives the expected answer. This is just a happy coincidence. Two losses of precision — the conversion of 0.7 to floating-point and the rounding up of the product — have effectively canceled each other out.

Behavior Depends on the Processor and How It Is Used

This anomaly may not occur on your machine. It depends on your processor, and in particular, the instructions used and the mode that it’s in. My example programs were compiled into Intel x87 FPU instructions, and ran with the x87 precision control field set to 53-bits. (I said above that my programs were using extended precision; technically they weren’t, but double precision is sufficient to make them “fail”. The true values of f1 * f2 are less than 53 bits.)

Intel processors can also do floating point using SSE (Streaming SIMD Extensions) instructions. I recompiled my programs using the Visual C++ compiler options /arch:SSE (single-precision floating-point) and /arch:SSE2 (double-precision floating-point). For the SSE option, there was no change; the compiler, at its discretion, still decided to generate x87 instructions. (See the comment below about the Mac’s use of SSE instructions making the anomaly disappear.)

Recompiling with /arch:SSE2 I got different assembler code, but the same output; here’s the assembler code for program 1:

f3 = f1 * f2; 00411313 cvtss2sd xmm0,dword ptr [f1] 00411318 cvtss2sd xmm1,dword ptr [f2] 0041131D mulsd xmm0,xmm1 00411321 cvtsd2ss xmm0,xmm0 00411325 movss dword ptr [f3],xmm0 if (f3 != f1 * f2) 0041132A cvtss2sd xmm0,dword ptr [f1] 0041132F cvtss2sd xmm1,dword ptr [f2] 00411334 mulsd xmm0,xmm1 00411338 cvtss2sd xmm1,dword ptr [f3] 0041133D ucomisd xmm1,xmm0

The SSE double-precision instructions are used. f3 — single precision — is compared to the double-precision intermediate result f1 * f2, resulting in a mismatch.

Please Try it Out

If you have access to a different compiler or processor, please try these programs out. Let me know what you find!

The Message

You can’t count on floats being handled as single precision values; they can be processed in double or extended precision, as dictated by your compiler and CPU.

(I have written a companion article called “When Doubles Don’t Behave Like Doubles”.)