Last time I challenged you to find a value which does not round correctly using the algorithm

Math.Floor(value + 0.5)

The value which does not round correctly is the double 0.49999999999999994 , which is the largest double that is smaller than 0.5 . With the given algorithm this rounds up to 1.0 , even though clearly 0.49999999999999994 is less than one half, and therefore should round down.

What the heck is going on here?

As you probably know, doubles are represented internally as binary fractions. The original double in binary[1. Obviously the mantissa is in binary and the exponent is in decimal.] is

1.1111111111111111111111111111111111111111111111111111 x 2-2

Or

0.011111111111111111111111111111111111111111111111111111

which in decimal is

0.499999999999999944488848768742172978818416595458984375

As you can see it is extremely close to the stated value. Again, just so we are clear: when in source code you say 0.49999999999999994 , that decimal fraction is rounded to the nearest binary fraction that has no more than 52 bits after the decimal place.

Already we have one rounding, and as you can see, we rounded up — but we’re still less than half.

Now suppose we took this value and added exactly 0.5 to it. We’d get

0.999999999999999944488848768742172978818416595458984375

which is still smaller than one. Therefore we’d expect this to go to zero when passed to Math.Floor , right?

Wrong; that fraction isn’t a legal double because it would require 53 bits of precision to represent exactly. A double only has 52 bits of precision, and therefore this must be rounded off when it becomes a double.

But rounded to what? Clearly it is extremely close to 1.0 . What is the largest double value that is smaller than 1.0 ? In binary that would be

1.1111111111111111111111111111111111111111111111111111 x 2-1

Or

0.11111111111111111111111111111111111111111111111111111

which is

0.99999999999999988897769753748434595763683319091796875

in decimal.

So we now have three relevant values:

0.999999999999999888977697537484345957636833190917968750 (largest double smaller than 1.0) 0.999999999999999944488848768742172978818416595458984375 (exact) 1.000000000000000000000000000000000000000000000000000000 (1.0)

Which one should we round the middle one to? Well, which is closer?

It’s OK, I’ll wait while you do the math by hand.

.

.

.

Neither is closer. The value is exactly in the middle of the two possibilities. The difference is

0.000000000000000055511151231257827021181583404541015625

in both directions.

Of course you could have worked that out a lot more easily in binary than in decimal! The three numbers are in binary:

0.111111111111111111111111111111111111111111111111111110 0.111111111111111111111111111111111111111111111111111111 1.000000000000000000000000000000000000000000000000000000

From which you can read off that difference between both pairs is 2-54.

When faced with this situation the floating point chip has got to make a decision and it decides to once again round up; it has no reason to believe that you really want it to round down. So the sum rounds to the double 1.0 , which is then “floored” to 1.0 .

Once again we see that floating point math does not behave at all like pen-and-paper math. Just because an algorithm would always work in exact arithmetic does not mean that it always works in inexact floating point. Multiple roundings of extremely small values can add up to a large difference.

I’ve posted the code I used to make all these calculations below; long-time readers will recognize it from my previous article on how to decompose a double into its constituent parts.

using System; using System.Collections.Generic; using System.Text; using Microsoft.SolverFoundation.Common; class Program { static void Main() { MyDouble closeToHalf_Double = 0.49999999999999994; Rational closeToHalf_Rational = DoubleToRational(closeToHalf_Double); Console.WriteLine("0.49999999999999994"); Console.WriteLine("Actual value of double in binary:"); Console.WriteLine("1.{0} x 2 to the {1}", closeToHalf_Double.MantissaBits.Join(), closeToHalf_Double.Exponent - 1023); Console.WriteLine("Actual value of double as fraction:"); Console.WriteLine(closeToHalf_Rational.ToString()); Console.WriteLine("Actual value of double in decimal:"); Console.WriteLine(closeToHalf_Rational.ToDecimalString()); Console.WriteLine(); MyDouble closeToOne_Double = 0.99999999999999994; Rational closeToOne_Rational = DoubleToRational(closeToOne_Double); Console.WriteLine("0.99999999999999994"); Console.WriteLine("Actual value of double in binary:"); Console.WriteLine("1.{0} x 2 to the {1}", closeToOne_Double.MantissaBits.Join(), closeToOne_Double.Exponent - 1023); Console.WriteLine("Actual value of double as fraction:"); Console.WriteLine(closeToOne_Rational.ToString()); Console.WriteLine("Actual value of double in decimal:"); Console.WriteLine(closeToOne_Rational.ToDecimalString()); Console.WriteLine(); // Now let's do the arithmetic in "infinite precision": Rational sum = closeToHalf_Rational + (Rational)0.5; Console.WriteLine("Sum in exact arithmetic as fraction:"); Console.WriteLine(closeToOne_Rational.ToString()); Console.WriteLine("Sum in exact arithmetic in decimal:"); Console.WriteLine(closeToOne_Rational.ToDecimalString()); // But that has more precision than will fit into a double, // so we have to round off. We have two possible values: Rational d1 = Rational.One - sum; Rational d2 = sum - closeToOne_Rational; Console.WriteLine("Difference from one:"); Console.WriteLine(d1.ToDecimalString()); Console.WriteLine("Difference from closeToOne:"); Console.WriteLine(d2.ToDecimalString()); } static Rational DoubleToRational(MyDouble d) { bool subnormal = d.Exponent == 0; var two = (Rational)2; var fraction = subnormal ? Rational.Zero : Rational.One; var adjust = subnormal ? 1 : 0; for (int bit = 51; bit >= 0; --bit) fraction += d.Mantissa.Bit(bit) * two.Exp(bit - 52 + adjust); fraction = fraction * two.Exp(d.Exponent - 1023); if (d.Sign == 1) fraction = -fraction; return fraction; } } struct MyDouble { private ulong bits; public MyDouble(double d) { this.bits = (ulong)BitConverter.DoubleToInt64Bits(d); } public int Sign { get { return this.bits.Bit(63); } } public int Exponent { get { return (int)this.bits.Bits(62, 52); } } public IEnumerable<int> ExponentBits { get { return this.bits.BitSeq(62, 52); } } public ulong Mantissa { get { return this.bits.Bits(51, 0); } } public IEnumerable MantissaBits { get { return this.bits.BitSeq(51, 0); } } public static implicit operator MyDouble(double d) { return new MyDouble(d); } } static class Extensions { public static int Bit(this ulong x, int bit) { return (int)((x >> bit) & 0x01); } public static ulong Bits(this ulong x, int high, int low) { x <<= (63 - high); x >>= (low + 63 - high); return x; } public static IEnumerable<int> BitSeq(this ulong x, int high, int low) { for (int bit = high; bit >= low; --bit) yield return x.Bit(bit); } public static Rational Exp(this Rational x, int y) { Rational result; Rational.Power(x, y, out result); return result; } public static string ToDecimalString(this Rational x) { var sb = new StringBuilder(); x.AppendDecimalString(sb, 50000); return sb.ToString(); } public static string Join<T>(this IEnumerable seq) { return string.Concat(seq); } }