What does this C function do?

double f(double a) { double b, c; b = 10*a - 10; c = a - 0.1*b; return (c); }

Based solely on reading the code, you’ll conclude that it always returns 1: c = a – 0.1*(10*a – 10) = a – (a-1) = 1. But if you execute the code, you’ll find that it may or may not return 1, depending on the input. If you know anything about binary floating-point arithmetic, that won’t surprise you; what might surprise you is how far from 1 the answer can be — as far away as a large negative number!

Examples of Incorrect Results

I tested my function in Visual C++; here’s a sampling of calls that gave an answer other than 1:

Examples For Which the Answer is Not One Function Call Result (to 17 Digits) f(5.1) 0.99999999999999911 f(91.3) 0.99999999999998579 f(451.9) 0.99999999999994316 f(7341.4) 0.99999999999909051 f(51367.7) 0.99999999999272404 f(897451.7) 0.99999999988358468 f(1923556.4) 0.99999999976716936 f(59567771.9) 0.9999999925494194 f(176498634.7) 0.99999997019767761 f(2399851001.7) 0.9999995231628418 f(60098761442.7) 0.99999237060546875 f(772555211114.1) 0.9998779296875 f(1209983553611.9) 0.999755859375 f(59871426773404.9) 0.9921875 f(190776306245331.2) 0.96875 f(2987154209766221.6) 0.5 f(19843566622234755.9) 0 f(719525522284533115.3) -128 f(8399871243556765103.9) 0 f(39847765103525225199.1) 0 f(553774558711019983333.9) -65536

(I got similar, but not identical results, when using Linux gcc and MinGW gcc.)

You can see that, as the input number gets bigger, the result gets farther from 1. When the input number gets big enough though, things get really wacky: zero and negative integers are returned!

Why the Calculations Go Wrong

The incorrect answers are due to the limits of finite floating-point precision: decimal input numbers, as well as the results of binary calculations done on them, are approximated with nearby floating-point numbers. Let’s see how this plays out in three of the example function calls:

f(5.1)

f(19843566622234755.9)

f(553774558711019983333.9)

For each example, I’ll show the double-precision value of each variable and intermediate result, as printed by David Gay’s dtoa() function.

f(5.1)

a = 5.0999999999999996447286321199499070644378662109375 b = 51 - 10 = 41 c = 5.0999999999999996447286321199499070644378662109375 - 4.10000000000000053290705182007513940334320068359375 ---------------------------------------------------- 0.99999999999999911182158029987476766109466552734375

The computed answer is close to 1, accurate to 15 decimal places.

f(19843566622234755.9)

a = 19843566622234756 b = 198435666222347552 - 10 = 198435666222347552 c = 19843566622234756 - 19843566622234756 = 0

The computed answer is 0.

The thing to notice is that the input has such a large integer part — it would require 55 bits — that it exceeds the limit of double-precision, which is 53 bits. This means that a’s integer part is represented inexactly, and its fractional part is not represented at all! This inaccuracy carries over into the calculations, which result in large, inaccurate integers — and a 0 result caused by catastrophic cancellation.

f(553774558711019983333.9)

a = 553774558711019995136 b = 5537745587110200475648 - 10 = 5537745587110200475648 c = 553774558711019995136 - 553774558711020060672 = -65536

The computed answer is -65536.

As in the prior example, the large integer part is responsible for the inaccurate calculations. This time, however, the floating-point representation of 0.1*b is greater than the floating-point representation of a, causing a negative integer result.

Is Getting an Answer of One Just Luck?

In some sense, getting an answer of 1 with floating-point is just luck; the inaccuracies have to cancel out. Let’s look at a function call that gives the correct answer, f(2.1):

a = 2.100000000000000088817841970012523233890533447265625 b = 21 - 10 = 11 c = 2.100000000000000088817841970012523233890533447265625 - 1.100000000000000088817841970012523233890533447265625 ----------------------------------------------------- 1

a and 0.1*b have the same fractional part, which cancels out in the subtraction.

Discussion

The function I analyzed is admittedly contrived, but it illustrates the inherent limits of floating-point calculations (even simple ones).