aliasing in c

I've seen a horrible but informative article on reddit about the brokenness of floating-point handling in mysql (or mariadb or whatever) . Unfortunately among the many ugly mistakes and bugs revealed by the article one part seems to got lost (at least nobody mentioned it in the reddit comments), I mean the following section (quote with mysql code):

static void set_param_double(Item_param *param, uchar **pos, ulong len) { double data; #ifndef EMBEDDED_LIBRARY if (len < 8) return; float8get(data,*pos); #else doubleget(data, *pos); #endif param->set_double((double) data); *pos+= 8; } Seems pretty straightforward stuff. the float8get macro was something like: config-win.h #define float8get(V,M) doubleget((V),(M)) #define doubleget(V,M) do { *((long *) &V) = *((long*) M); \ *(((long *) &V)+1) = *(((long*) M)+1); } while(0) Now this is precisely the reason I've always hated C style languages. These endless chains of pointers to pointers to YET MORE POINTERS. And don't you just love how two statements are enclosed in a ficticious while loop just to make them one statement? Anyway, doubleget doesn't seem a big deal : just moves around two 4-byte longs from a buffer to somewhere else (in our case, to the "data" variable).

(end of quote)

One should not hate C for that, because this is incredibly broken and highly unidiomatic code invoking dangerous undefined behaviour that no serious project should ever have.

It is obvious that this deserialization method is bad code: it uses pointer casts on unsigned char array, potentionally invoking undefined behaviour with unaligned access (C11 6.3.2.3p7). But the macro is in a platform specific header file and the platform might allow unaligned access so such sloppyness can be forgiven. The deserialization is completely broken if this is not guaranteed to be on the same host as the serialization was done (eg doing IPC), but I'll come back to network transparency later.

It is alarming that len is only checked in one code path (when EMBEDDED_LIBRARY is not defined) and the error cannot be reported back at all, making the caller believe everything went fine, but we will ignore this for now. (Let's hope param is properly initialized or len is always >= 8)

The way the doubleget function-like macro evaluates its arguments several times and without at least parethesizing them to avoid operator precedence problems (see &V) is ugly and can easily lead to bugs later, but it's not the most problematic part either. (Let's pretend that EMBEDDED_LIBRARY is never defined so float8get is used)

The casting of double data to double is obviously silly and completely useless. (The only reason why such a cast could be necessary is to drop excess precision (C11 6.5.4p6), but if that's the intention then

the macro could do that (and do it better as for example gcc used not to follow the ISO C99 (and C++03) standard and failed to do the rounding in some casts and assignments, and still does unless -fexcess-precision=standard or -std=c99 is specified, other compilers are not better either)

the argument passing to set_double function call does it as well (C11 6.5.2.2p4, C11 6.5.16.1p2) (unless its argument type is long double, which would be highly surprising given the name).

if you get excess precision during 8 byte to double deserialization then you do something wrong.

The real problem is the way the double data is set

*((long *) &V) = *((long*) M); *(((long *) &V)+1) = *(((long*) M)+1);

A correct solution that is more network transparent and portable is

double deserialize_ieee_float64(unsigned char *p) { union {uint64_t i; double f;} u = { (uint64_t)p[0] << 56 | (uint64_t)p[1] << 48 | (uint64_t)p[2] << 40 | (uint64_t)p[3] << 32 | (uint64_t)p[4] << 24 | (uint64_t)p[5] << 16 | (uint64_t)p[6] << 8 | p[7] }; return u.f; }

Here I used a big-endian serialization format, if efficiency is a concern then choose the byte order so that it matches the one on the host where the performance matters, the compiler should be able to figure out the correct optimizations if unaligned access etc. is allowed on the given platform (one can use static inline so the compiler has better chance doing the right thing). Note that aliasing is not a concern now, objects declared as unions are allowed to be accessed this way and the value is determined through type punning when a different member is used for access than for storage (C11 6.5p7, C11 6.5.2.3p3 see note 95).

Correction: (2013-05-27) according to some comments in the reddit discussion of this article the above claim is not supported by the citations. I admit that it's not very clear, but my interpretation is this: During the initialization of the union object u, the bytes corresponding to its member i are initialized (to implementation defined values). Then refering to the second citation (see the note) u.f gives a reinterpreted value based on the bytes of u according to the type representation rules. (double has unspecified representation, so here we need an assumption that is explained below). I recall seeing gcc developers interpreting the standard differently and considering union based type punning non-conformant, but they guarantee that the above code works as expected as an extension anyway. The first citation seems to agree with them: the lvalue expression u.f has type double and it accesses the stored value of an object with effective type uint64_t which violates the rules stated there, however if that interpretation were correct then setting u.i then u.f could be reordered as well which is against the intention and general language of the standard. The closest committee interpretation to this problem I could find is the committee response to DR 283 (so in C89 the type punning through union was explicitly allowed and this is the intention in C99 as well, but it's not clear how the aliasing rule applies). (If this explanation is not adequate then a guaranteed safe solution is using memcpy or going through a char array to copy the bytes of the uint64_t to a double. However some implementation might not be able to optimize the copy away.)

Another important note is that the same 64 bit double format is assumed on all hosts, which is reasonable: every implementation uses IEEE 754 binary64 double nowadays (the optional C99 Annex F extension even requires this (C11 F.2p1)) although there are still some machines in operation that use different floating-point format. A braver assumption is that the bit representation of double with said floating-point format always maps to the bit representation of uint64_t the same way (in practice this means that double and uint64_t have the same byte order), this is not necessarily true, eg. on ARM with the silly and deprecated floating-point accelerator (FPA) double had mixed endian representation unlike uint64_t, but with sane implementations this assumption should hold too. If these assumptions are a concern then a completely conformant network transparent serilization can use the "%a" format specifier of C99 sprintf and sscanf. (That one will be less efficient in space and time though, it also does not carry the NaN representation through on IEEE platforms) (Other means of serialization of the sign, significand and exponent can be used as well eg. by relying on the signbit macro and the frexp function, but some care should be taken around +-Inf and NaN values, which might not be possible to represent on some platforms).

2013-05-25,2013-05-27 - nsz