Experiments do not always work as planned. Sometimes you may invest a lot of time into a (sub)project only to get no, or only moderately interesting results. Such a (moderately) failed experiment is the topic of this week’s blog post.

Some time ago I wrote a CSV exporter for an application I was writing and, amongst the fields I needed to export, were floating point values. The application was developed under Visual Studio 2005 and I really didn’t like how VS2005’s printf function handled the formats for floats. To export values losslessly, that is, you could read back exactly what you wrote to file, I decided to use the "%e" format specifier for printf . Turned out that it was neither lossless nor minimal!

If you use "%f" as a format specifier, printf uses a readable default format that is friendly only to moderate values, that is, neither too small nor too big. If the value printed is too big or too small, you lose precision or you get an exaggeratedly long representation. Consider:

printf("%f

",1e-20f); printf("%f

",3.14e20f);

you get the output:

0.000000 314000008403897810944.000000

In the first case, you loose all information, in the second you have an extra long representation. However, if you use the engineering notation specifier "%e" , you get the more sensible (but not always shorter) output:

9.99999968e-021 -3.14000008e+020

With GCC, you get e+20 instead of e+020 but you still get extra non-significant + and zeroes, and, in bonus, weird rounding artifacts. Indeed, 3.14e20 is a valid and sufficient representation for the same number. Visually, we could reduce the printf format to "%e" , but while it would be visually more pleasing, rescanning output values will not always yield the original machine-precision float. In some cases it does, but not always, and that’s bad.

So if "%e" isn’t sufficient (it defaults to 6 digits after the point), how many digits do we need? A float in IEEE 754 representation as one sign bit, 8 bits for exponent and 23 bits for the mantissa. Since it also has a virtual most significant bit (which is 1), the mantissa can be seen as 24 bits, with a leading 1. This leaves us with digits, which makes ".7e" a bit short, as quick testing shows, and makes ".8e" format necessary.

So the first goal here is to save floats in text losslessly but we’d also like them to be as short as possible. Not so much for data compaction that for human reading; 3.14e20 is still better than 3.1400000e+020 . So, naively, I set to write “clean up” code:

//////////////////////////////////////// // // Rules: // // floats are of the form: // // x.yyyyyyyyye±zzz // // Removes sign after e if it's a + // Removes all non significant zeroes. // Packs the resulting representation // static void fd_strip(char s[]) { // find e char * e = strchr(s,'e'); if (e) { char * d = e+1; // remove (redundant) unary + if (*d=='+') *d='_'; d++; // strip leading zeroes in exponent while (*d && (*d=='0')) *d++='_'; if (*d==0) // the exponent is all zeroes? *e='_'; // remove e as well // rewind and remove non-significant zeroes e--; while ((e>s) && (*e=='0')) *e--='_'; // go forward and pack over _ for (d=e; *e; e++) if (*e!='_') *d++=*e; *d=0; } else { // some kind of nan or denormal, // don't touch! (like ±inf) } } //////////////////////////////////////// // // simplistic itoa-style conversion for // floats (buffer should be at least 16 // char long) (0.12345678e±12 + \0) // void ftoa(float f, char buffer []) { snprintf(buffer,16,"%.8e",f); // redundant but safer fd_strip(buffer); }

As I have said in a previous post, if your function’s domain is small enough, you should use exhaustive testing rather than manual testing based on a priori knowledge you have. Floats are a pest to enumerate in strict order (because, for example, the machine-specific epsilon works only for numbers close to 1, and does nothing for 1e20) so I build a (machine-specific) bit-field:

typedef union { float value; // The float's value unsigned int int_value; struct { unsigned int mantissa:23; unsigned int exponent:8; unsigned int sign:1; } sFloatBits; // internal float structure } tFloatStruct;

that allows me to control every part of a float. looping through signs, exponents, and mantissas will allow us to generate all possible floats, including infinities, denormals, and NaNs.

The main loop looks like:

tFloatStruct fx; for (sign=0;sign<2;sign++) { fx.sFloatBits.sign = sign; for (exponent=0; exponent<256; exponent++) { fx.sFloatBits.exponent=exponent; for (mantissa=0; mantissa< (1u<<24); mantissa++) { float new_fx, diff; char ftoa_buffer[40]; // some room for system-specific behavior1 char sprintf_buffer[40]; // ? fx.sFloatBits.mantissa=mantissa; if (isnan(fx.value)) { // we don't really care for // NaNs, but we should check // that they decode all right? // // but nan==nan is always false! } else { // once in a while if ((mantissa & 0xffffu)==0) { printf("\r%1x:%02x:%06x %-.8e",sign,exponent,mantissa, fx.value); fflush(stdout); // prevents display bugs } how_many++; ftoa(fx.value,ftoa_buffer); sprintf(sprintf_buffer,"%.8e",fx.value); // gather stats on length // ftoa_length+=strlen(ftoa_buffer); sprintf_length+=strlen(sprintf_buffer); // check if invertible // new_fx = (float)atof( ftoa_buffer ); if (new_fx!=fx.value) { diff = (new_fx - fx.value); printf("

%e %s %e %e

", fx.value, ftoa_buffer, new_fx, diff); } } } // for mantissa } // for exp } // for sign printf("



"); printf(" average length for %%-.8e = %f

", sprintf_length/(float)how_many); printf(" average length for ftoa = %f

", ftoa_length/(float)how_many); printf(" saved: %f

",(sprintf_length-ftoa_length)/(float)how_many);

So, we run this and verify that 1) all floats are transcoded losslessly and 2) the ftoa is much shorter than printf ‘s. Or is it?

*

* *

After a few hours of running the program (it takes a little more than 6 hours on my computer at home), the results are somewhat disappointing. First, it does transcode all floats correctly. But it doesn’t shorten the representation significantly.

Using GCC (and ICC), you get that the average representation length out of ".8e" without tweaking is 14.5 digits (including signs and exponents). Using ftoa (and fd_strip ), the representation is shortened to 13.53 digits on average, an average saving of 0.96, which is far from fantastic.

With visual studio, the savings are a bit better, but clearly not supercalifragilisticexpialidocious either: from an average of 15.5 digits, it reduces to 13.6, an average saving of 1.9 digit.

With double s, the results are quite similar. On GCC (and ICC) you start with an average length of 23.2, and of 22.5 after “simplification”. For double , you have to use ".16e" to get losslessness.

*

* *

So, it turns out that it was a lot of work for nothing not much. On the bright side, we figured out that 7 digits aren’t always enough to save floats in text and get them back losslessly; while the documentation says it should be only seven. Maybe it’s a platform-specific artifact, maybe not; anyway, it’s much better to save numbers losslessly than saving them so that they are merely pretty to look at.

Acknowledgements

I would like to thank my friend Mathieu for running the experiments (and help debug) on Visual Studio Express 2008. The help is greatly appreciated, as I didn’t have a Windows+Visual Studio machine at hand at the time of writing.

Share this: Reddit

Twitter

More

Facebook

Email



Like this: Like Loading... Related