The C++ Source

Backyard Hotrodding C++

by Walter Bright

May 23, 2006




Backyard Hot-Rodding C++

Summary Ever feel the need for speed? Real speed? If you're careful, you can get it without making a mess of things. Ever feel the need for speed? Real speed? If you're careful, you can get it without making a mess of things.

Attend a seminar or read any book or article on C++, and there will be presented more or less the "correct" way to write C++ programs (or at least the way the author feels is correct). Certain practices are deprecated as "buggy", "sloppy", "non-portable", "undefined", "illegal" or just plain "bad." We are exhorted not to do such things, at the risk of bugs, being fired, or the ridicule of our peers.

But to take an analogy from drag racing, sometime what matters is getting across the finish line the fastest. How you did it does not matter. Certainly, everything and anything has been tried by backyard hotrodders.

What are some things that can be done with C++, that a good backyard hotrodder with greasy overalls and a pack of cigs rolled in his shirtsleeve would do, to get a program speeding over the finish line first?

The rules of this game are simple—to be a backyard C++ hotrodding modification, it's got to make the program faster than the "correct" way.

Warning: use of these techniques is rumored to be illegal in at least 5 states. They may cause you to be fired. C++ gurus will heap contempt upon your head. You'll get an "F" from your professor. Your warranty will be voided.

So let's pop the hood and slip that bottle of nitrous in.

Pointer Hydration

Let's say we have a large, in-memory data structure. We'd like to save this data structure into a disk file, and in some other invocation of the program, read it back in and reconstitute the data structure.

The proper way to do this is to serialize the data structure, object by object, into some disk friendly format. The format is then read, parsed, and object by object, it gets recreated back in memory. Each object needs a serialize function, where one must figure out how to represent each member on disk, and a constructor that can build a replica of that object from the serialized data. When the data structure is non-trivial, with cycles, pointers, shared objects, etc., this is not a simple task. Furthermore, it is inherently slow. Each object must be broken down into its serial representation, and when reconstituting, each object must be allocated, constructed, and put back together.

But the data structure is already in memory, in a sequence of bytes. Couldn't that just be copied to disk, and read back in again? Yes, if we're willing to step outside of defined behavior.

The first thing to do is organize the data structure so it is entirely contained within a known block or sequence of blocks of memory. This means it may not contain any pointers or references to objects outside of itself. That also means no references to string literals, which are stored in the static program data, or any other static data.

Next, create a custom allocator/deallocator for the objects that go into the data structure, so that they are allocated within a known block or sequence of blocks, and not mixed up with other allocations that are not to be saved.

So, those blocks can then just be rolled out directly into disk files, and rolled back in. Right? Nope, because there are pointers. When it's rolled back in, it's probably not going to wind up at the same address, and the pointers will all be pointing to the wrong place.

We need a way to dehydrate the pointers when writing the blocks to disk, and hydrate them (just add water!) when reading a block back in, so they become real pointers again. This can be done by walking the data structure, and for each pointer in each object, convert that pointer to an address to an offset from the start of the block. Each class gets a dehydrate function, which handles all the pointers in its objects.

But there's a problem. If the data structure is cyclic, or has multiple pointers to the same object, we need a way to determine if a pointer has been dehydrated already or not. Just comparing the pointer value to see what range it lies in is not enough, as an offset into the block(s) may overlap the physical addresses the block is mapped into.

The solution lies in noticing that allocators typically allocate data that is aligned on 2 or 4 byte boundaries (and since we're writing custom allocator for this, we can ensure this property). That means, for a valid pointer, the least significant bit will always be 0. We can then define a dehydrated pointer as being odd, and a hydrated pointer as being 1. Dehydrating a pointer then becomes:

void ptr_dehydrate(void **p) { if (*p) *(long *)p |= 1; }

int isdehydrated(void *p) { return ((long)p & 1) != 0; } void ptr_hydrate(void **p) { if (isdehydrated(*p)) *(char **)p -= ptr_adjust; }

class Foo { void *p; Bar *o; virtual void dehydrate() { ptr_dehydrate(&p); if (o && !isdehydrated(o)) { o->dehydrate(); ptr_dehydrate(&o); } } virtual void hydrate() { ptr_hydrate(&p); if (isdehydrated(o)) { o->hydrate(); ptr_hydrate(&o); } } }

Pointers into the middle of strings may already be odd, so this technique won't work.

Ensure there are no references to static data or data outside of the block(s).

Any objects with virtual functions, or derived from a class with virtual functions, are going to have a vptr member which points to a static vtbl array. These are all part of the implementation-defined areas of any C++ class implementation. It's likely that the vptrs will not be pointing to the right vtbl when the object is reconstituted. The ways to deal with this is to 1) avoid using such classes or 2) when the object is being hydrated, jam the right vptr value.

The code will have to be adjusted so that the test/set of the least significant bit of the pointer really is the least significant bit.

Vptr Jamming

vptr

vtbl[]

vptr

vtbl[]

vtbl[]

vptr

vptr

and hydrating it becomes:and for a class:In practice, this works out really fast. But since it is "bad" C++ code, there are some problem areas that must be avoided or accounted for:Vptrs are one of the typical under-the-hood implementation details of a class. Virtual functions are often implemented by adding a hidden member called awhich points to a table of functions called the. A virtual function call is performed by using theto find the, and then calling the function at a specific index into that. Therefore, the polymorphic behavior of a class object is controlled by where the hiddenmember is pointing. Themember is set when the object is constructed by some hidden code added by the compiler to every constructor for that object's class.

So, by manipulating the vptr ourselves, we can control the behavior of an object, even change its type, without calling a constructor. This technique is called vptr jamming.

For example, consider a collection class which needs to be very fast. Most of the time, it will be used in a single-threaded manner, but sometimes, in a multithreaded manner. The program can go back and forth between single and multithreaded more than once during execution. It's got to run as fast as possible, so in single-threaded mode the time spent to do locks is unaffordable. So, our class might look like:

struct Collection { ... members of the collection ... virtual void foo() = 0; }; struct SingleThreadedCollection : Collection { void foo() { ... optimized for single threaded ... } }; struct MultiThreadedCollection : Collection { void foo() { ... synced for multithreaded ... } };

vptr

struct Collection { ... members of the collection ... virtual void foo() = 0; void toSingle(); void toMulti(); }; struct SingleThreadedCollection : Collection { static SingleThreadedCollection tmp; void foo() { ... optimized for single threaded ... } }; struct MultiThreadedCollection : Collection { static MultiThreadedCollection tmp; void foo() { ... synced for multi threaded ... } }; SingleThreadedCollection SingleThreadedCollection::tmp; MultiThreadedCollection MultiThreadedCollection::tmp; void Collection::toSingle() { *(void **)this = *(void **)&SingleThreadedCollection::tmp; } void Collection::toMulti() { *(void **)this = *(void **)&MultiThreadedCollection::tmp; }

toXxxx()

vptr

tmp

vptr

*this

The desire is to switch back and forth between single and multithreaded operations without having to destruct and reconstruct the object—we want to dynamically change the behavior by switching (i.e. jamming) the. Here's how that would look:The assignment in thefunctions gets the value of the rightfrom the static temporarycreated for just that purpose, and jams it into thelocation in. For most compilers, the vptr is at offset 0 of the struct. For the rest, this code will have to be tweaked to account.

Naturally, there are problems that must be avoided or accounted for here as well:

All the class types which are to be jammed must be of the same size, or at least, the allocated size must be the size of the largest.

Different compilers, and even different versions of the same compiler, will store the vptr at different offsets within the object instance. A bit of simple testing and looking at the generated assembler will quickly find it.

at different offsets within the object instance. A bit of simple testing and looking at the generated assembler will quickly find it. Vptr jamming assumes that, for each of the class types to be jammed, the vptr will be at the same offset. This is true of all the compilers I tried it on, but there is certainly no guarantee that this is true.

RTTI Sniping

vptr

vtbl[]

vptr

vtbl[]

As observed withjamming, the polymorphic behavior of an object instance is controlled by whatitsis pointing too. It's not a big leap from that to realize that by testing the value in the, the type of the object can be determined.

The usual way to determine the derived type of an object is by doing a dynamic_cast . If the dynamic_cast to a derived class succeeds, it returns a pointer to the derived class. If it fails, it returns NULL:

dynamic_cast is slooooww. For example:

struct A { virtual int foo(); }; struct B : A { int foo(); }; int test(A *a) { return dynamic_cast<B*>(a) != 0; }

a

B

mov EAX,4[ESP] test EAX,EAX je L24 push 0 push offset FLAT:___ti?AUB@@ push offset FLAT:___ti?AUA@@ push EAX mov ECX,[EAX] push dword ptr -4[ECX] call near ptr ?__rtti_cast@@YAPAXPAX0PBD1H@Z add ESP,014h jmp short L26 L24: xor EAX,EAX L26: neg EAX sbb EAX,EAX neg EAX ret

Here's the generated assembly code for the test to see ifis really an instance ofThere are lots of instructions being executed, and a function call. It also relies on RTTI being generated for the class, which is bbllooaatt.

If only we could snipe the RTTI and figure out the type directly. If we've got the need for speed, we can do the following:

B tmp; int test(A *a) { return *(void**)a == *(void**)&tmp; }

vptr

tmp

vptr

a

&tmp

All this does is compare thein a with the vptr in. Most compilers put theas the first member in a class most of the time, so this will work. When it doesn't, adjust the offset toandto match.

The generated assembler code looks like:

mov EAX,4[ESP] mov ECX,[EAX] cmp ECX,?tmp@@3UB@@A mov EAX,1 je L15 xor EAX,EAX L15: ret

__istype

int test(A *a) { return a->__istype(B) != 0; }

mov EAX,4[ESP] cmp dword ptr [EAX],offset FLAT:??_QB@@6B@[4] mov EAX,1 je L13 xor EAX,EAX L13: ret

dynamic_cast

vtbl[]

dynamic_cast

Holy hotrod, Batman! That brought the test for the type down to two instructions. We can even do slightly better. The Digital Mars C++ compiler has special support for RTTI sniping with thepseudo member function:and we're down to one instruction:The obvious question is, why doesn'tproduce the short, fast code? The answer is that RTTI sniping only works if the class type being tested for is the most derived class in the class heirarchy (because that determines the), whereasneeds to work for any derived class.

Once again, there are problems with RTTI sniping:

Some implementations are able to share vtbl[] s between classes, so the vptr for class B and class A , where B is derived from A , point to the same value. This clever compiler optimization must be defeated, sometimes it can be via a switch to "turn on RTTI", or by some other switch. Worst case, avoid using RTTI sniping between classes derived from one another.

s between classes, so the for class and class , where is derived from , point to the same value. This clever compiler optimization must be defeated, sometimes it can be via a switch to "turn on RTTI", or by some other switch. Worst case, avoid using RTTI sniping between classes derived from one another. Some (generally older) compilers can generate multiple vtbl[] s for the same class, so that two vptr s can hold different addresses, but still be the same type. Fortunately, such compilers are rarely used these days. But the problem can still crop up if one DLL generates one instance while another DLL generates another. The moral is to have all the constructors for a particular object implemented in one source file, and have that be only in one DLL, not many.

The Counterfeit this

Declare the implementation as private: #include "implementation.h" class Foo { private: ... the implementation ... // the interface public: void bar() { ... manipulate the implementation ... } }; The trouble with this, of course, is that the implementation is still there with its bare face hanging out, and in order to compile it, every irrelevant thing that the implementation needs has to be in scope, too. Use the PIMPL idiom as described by Herb Sutter, where the class contains a pointer to the implementation of the class: // User sees this class definition class Implementation; // stub definition class Foo { private: Implementation *pimpl; // the interface public: Foo(); void bar(); }; // Separate, hidden version of Foo #include "implementation.h" Foo::Foo() : pimpl(new Implementation()) { } void Foo::bar() { pimpl->bar(); }

The two conventional methods for hiding the implementation of a class are:This succeeds in hiding the implementation details, at the cost of another layer of allocation and an extra object instance.

But there's a way to hide the implementation completely without having an extra object. The idea is to counterfeit the this pointer, so that the user thinks it is one type, but the implementation knows it is another:

// User sees this class definition class Foo { // the interface public: Foo *factory(); // create and initialize an instance void bar(); }; // Separate, hidden version of Foo #include "implementation.h" Foo *Foo::factory() { return reinterpret_cast<Foo *>new Implementation(); } void Foo::bar() { (reinterpret_cast<Implementation *>this)->bar(); }

reinterpret_cast

Implementation

Foo

Theis doing the dirty work of counterfeiting the type of the object fromtoand back again.

Caveats:

The class Foo cannot have any data members, even hidden ones like a vptr . Therefore, it cannot have any virtual functions.

cannot have any data members, even hidden ones like a . Therefore, it cannot have any virtual functions. Foo cannot have any constructors, because we aren't constructing a real Foo , only a counterfeit one.

Conclusion

C++ provides great tools for under the hood optimization, but its use is uniformly and actively discouraged. If you're willing to accept that you're going outside all recommended practice, there are some neat things to be done to hotrod your C++ application.

These techniques are also applicable to the D programming language[1].

Sometimes, you just feel the need for speed.

Talk back!

Have an opinion on the ideas presented in this article? Please post them in the forum topic for this article, Backyard Hotrodding C++.

Notes and References

About the Author

Walter Bright graduated from Caltech in 1979 with a degree in mechanical engineering. He worked for Boeing for 3 years on the development of the 757 stabilizer trim system. He then switched to writing software, in particular compilers, and has been writing them ever since.