This little C program converts a decimal value — represented as a string — into a double-precision floating-point number:

#include <string.h> int main (void) { double intPart = 0, fracPart = 0, conversion; unsigned int i; char decimal[] = "3.14159"; i = 0; /* Left to right */ while (decimal[i] != '.') { intPart = intPart*10 + (decimal[i] - '0'); i++; } i = strlen(decimal)-1; /* Right to left */ while (decimal[i] != '.') { fracPart = (fracPart + (decimal[i] - '0'))/10; i--; } conversion = intPart + fracPart; }

The conversion is done using the elegant Horner’s method, summing each digit according to its decimal place value. So why do I call it “quick and dirty?” Because the binary floating-point value it produces is not necessarily the closest approximation to the input decimal value — the so-called correctly rounded result. (Remember that most real numbers cannot be represented exactly in floating-point.) Most of the time it will produce the correctly rounded result, but sometimes it won’t — the result will be off in its least significant bit(s). There’s just not enough precision in floating-point to guarantee the result is correct every time.

I will demonstrate this program with different input values, some of which convert correctly, and some of which don’t. In the end, you’ll appreciate one reason why library functions like strtod() exist — to perform efficient, correctly rounded conversion.

Testing the Conversions

How do you test whether a quick and dirty conversion is correct? First you need to know what the correctly rounded value is, and then you need to compare it to the quick and dirty value.

Here are two ways to find the correctly rounded value:

Assign the decimal constant to a double and let your compiler compute it for you.

Enter the decimal constant in my arbitrary-precision decimal/binary converter and then round it by hand to 53 significant bits (the number of significant bits in a double).

I did both, because I know that sometimes even compilers do the conversions incorrectly (see my articles on Visual C++ and gcc/glibc). In any case, I verified that all the compiler conversions in the examples below are correct.

Next we need to compare the quick and dirty value to the correctly rounded value — but how? Comparing floating-point values for equality is not reliable. That means we should display the values and compare them manually. But we can’t depend on printf(), because in many implementations it won’t print all the significant digits, either before the decimal point or after the decimal point (this is true of Visual C++, the compiler I am using). This leaves several options:

Print the raw IEEE representation of the double, using my functions print_raw_double_hex() or print_raw_double_binary().

Print the value of the double as a hexadecimal floating-point constant.

Print the value of the double in binary scientific notation, using my function print_double_binsci().

Print the value of the double in binary, using my function fp2bin().

I think seeing the binary representation is the most enlightening; of the three binary display options, I chose fp2bin().

The Test Code

I modified the code in the introduction of this article to allow for testing, as described; here it is

#include <string.h> #include <stdio.h> #include "fp2bin.h" int main (void) { char binString[FP2BIN_STRING_MAX]; double intPart = 0, fracPart = 0, conversion; unsigned int i; char decimal[] = "3.14159"; /* Converted manually */ double compiler_conversion = 3.14159; /* Converted by compiler */ i = 0; /* Left to right */ while (decimal[i] != '.') { intPart = intPart*10 + (decimal[i] - '0'); i++; } i = strlen(decimal)-1; /* Right to left */ while (decimal[i] != '.') { fracPart = (fracPart + (decimal[i] - '0'))/10; i--; } conversion = intPart + fracPart; /* Print the converted values, in binary */ fp2bin(compiler_conversion,binString); printf("Correct = %s

",binString); fp2bin(conversion,binString); printf("Q and D = %s

",binString); }

When run (I ran on an Intel Core Duo processor), the output shows that the compiler and quick and dirty conversions of 3.14159 are both correct:

Correct = 11.00100100001111110011111000000011011100001100110111 Q and D = 11.00100100001111110011111000000011011100001100110111

Their value in decimal, obtained from my binary/decimal converter, is 3.14158999999999988261834005243144929409027099609375.

Incorrect Conversions

Incorrect conversions are due to the limitations of floating-point arithmetic; specifically, these three things affect the conversion calculations:

Large integers require more bits than can fit in a floating-point variable . Multiplying a binary integer by ten is exact, but once the integer exceeds the largest integer that a double can represent exactly, rounding errors can occur.

. Division by ten in binary arithmetic is inexact . Each time the fractional value is divided by ten, there is a chance of roundoff error.

. Addition of integers and fractions can incur rounding error. When an integer and fraction are added, some of the bits of the fraction can be lost.

(Also, values near DBL_MAX can convert incorrectly to “infinity,” although I’m not considering overflow here.)

The example values below — which I discovered by both random testing and trial and error — are all rounded incorrectly by my quick and dirty program.

The Integer 18014398509481993

The number 18,014,398,509,481,993 (254 + 9) is not converted correctly by the quick and dirty program; this is its output (the lines I changed are char decimal[] = “18014398509481993.0”; and double compiler_conversion = 18014398509481993.0;):

Correct = 1000000000000000000000000000000000000000000000000001000.0 Q and D = 1000000000000000000000000000000000000000000000000001 100 .0

Their values, in decimal, are

Correct = 18014398509481992 = 2 54 + 8

+ Q and D = 18014398509481996 = 254 + 12

These values differ by 4, which is one unit in the last place (the last place for a double is the 53rd significant bit).

The Fraction 0.9199

The number 0.9199 is not converted correctly by the quick and dirty program; this is its output (the lines I changed are char decimal[] = “0.9199”; and double compiler_conversion = 0.9199;):

Correct = 0.11101011011111101001000011111111100101110010010001111 Q and D = 0.1110101101111110100100001111111110010111001001000111

Their values, in decimal, are

Correct = 0.91990000000000005098144129078718833625316619873046875 = 8285722594436239/2 53

Q and D = 0.9198999999999999399591388282715342938899993896484375 = 4142861297218119/252 = 8285722594436238/253

These values differ by 2-53, which is one unit in the last place.

The Combined Integer and Fraction 1.89

The number 1.89 is not converted correctly by the quick and dirty program; this is its output (the lines I changed are char decimal[] = “1.89”; and double compiler_conversion = 1.89;):

Correct = 1.1110001111010111000010100011110101110000101000111101 Q and D = 1.11100011110101110000101000111101011100001010001111 1

Their values, in decimal, are

Correct = 1.8899999999999999023003738329862244427204132080078125 = 1 + 4008203668359741/2 52

Q and D = 1.890000000000000124344978758017532527446746826171875 = 1 + 2004101834179871/251 = 1 + 4008203668359742/252

These values differ by 2-52, which is one unit in the last place.

The interesting thing is that 0.89 by itself converts correctly; it’s only when it’s added to 1 that the final result becomes incorrect.

The Fraction 3.50582559e-71

The number 3.50582559e-71 is not converted correctly by the quick and dirty program; this is its output (in the program source, I wrote 3.50582559e-71 as 0.0000000000000000000000000000000000000000000000000000000000000000000000350582559):

Correct = 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011110111110001010011001011001000010111111010110010001 Q and D = 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000111101111100010100110010110010000101111110101100 00011

Their values, in decimal, are

Correct = 0.00000000000000000000000000000000000000000000000000000000000000000000003505825590000000127666442084156194149833722363535945086623257683235283315238231093157098293845387261005915287514287812242780435016915881911641225954878731368200574173594143601202788307347191221197135746479034423828125 = 8717642643535249/2 287

Q and D = 0.00000000000000000000000000000000000000000000000000000000000000000000003505825589999994497525328603499564935354782628876135671208090707866566080104779573609823758603905804048854042779018384586485878398467052729618730030977116116317972699506139481161193227620742618455551564693450927734375 = 8717642643535235/2287

These values differ by 14/2287, which is 14 units in the last place. (This is the largest error I found — in fact, the only example of an error this large — in the over 100 million random conversions I tried).

Discussion

In practice, you’ll depend on your compiler or interpreter or a library call for correctly rounded conversion. The de facto standard code used by many implementations is David Gay’s strtod() function in dtoa.c. Take a look at it to get a sense of how involved correctly rounded conversion is. (dtoa.c handles scientific notation, and also includes code for correct conversion in the other direction — floating-point to decimal).

For a little more detailed discussion of Horner’s method, see my PHP functions bin2dec_i() and bin2dec_f(). Those two functions implement the same algorithms as the quick and dirty program, only using decimal arithmetic and powers of two (there’s nothing “dirty” about them though — they’re binary to decimal and arbitrary precision).

Conversion to Single-Precision

Conversion to floats works similarly, also producing correctly and incorrectly rounded results. There’s one thing I observed, which you might find surprising: a given decimal value may convert incorrectly to a double but correctly to a float!