The "Empty Member" C++ Optimization by Nathan C. Myers The Standard C++ library is filled with useful templates, including extended versions of those found in the award-winning STL (See DDJ Mar95). These templates offer great flexibility, yet they are optimized for best performance in the common case. As useful as they are in programs, they are equally useful as examples of effective design, and as a source of inspiration for ways to make your own components as efficient and flexible. Some of the ways that they offer flexibility involve "empty" classes: classes which have no data members. Ideally, these empty classes should consume no space at all. They typically declare typedefs or member functions, and you can replace them with your own classes (which might not be empty) to handle special needs. The default empty class is by far the most commonly used, however, so this case must be optimal so that we do not all pay for the occasional special needs of a few. Due to an unfortunate detail of the language definition (explained fully later), instances of empty classes usually occupy storage. In members of other classes, this space overhead can make otherwise- small objects large enough to be unusable in some applications. If this overhead couldn't be avoided in the construction of the Standard Library, the cost of the library's flexibility would drive away many of its intended users. Optimization techniques used in the Standard library are equally useful in your own code. Empty Member Bloat Here's an example. In the Standard C++ library each STL Container constructor takes, and copies, an allocator argument. Whenever the container needs storage, it asks its allocator member. In this way a user with some special memory to allocate (such as a shared memory region) can define an allocator for it, and pass that to a container constructor, so that container elements will be placed there. In the common case, however, the Standard default allocator is used, which just delegates to the global operator new . It is an empty object. Here's simplified code for a possible implementation of these components. In Listing 1, the Standard default allocator, allocator<> , only has function members. Listing 1: The Standard Default Allocator template <class T> class allocator { // an empty class . . . static T* allocate(size_t n) { return (T*) ::operator new(n * sizeof T); } . . . }; In Listing 2, the generic list container template keeps a private allocator member, copied from its constructor argument. Listing 2: The Standard Container list<> template <class T, class Alloc = allocator<T> > class list { . . . Alloc alloc_; struct Node { . . . }; Node* head_; public: explicit list(Alloc const& a = Alloc()) : alloc_(a) { . . . } . . . }; Notes: The list<> constructor is declared "explicit" so that the compiler will not use it as an automatic conversion. (This is [was once] a recent language feature.) How a list<> gets storage for Node objects, from an allocator intended to supply storage for T objects, is a subject for an upcoming article. [The article never came up. It suffices to note that allocator<U> has a constructor from allocator<T>, for any U, T.] The member list<>::alloc_ occupies, usually, four bytes in the object, even in the default case where Alloc is the empty class allocator<T> . A few extra bytes for the list object doesn't seem so bad until you imagine a big vector of these lists, as in a hash table. Any extra junk in each list is multiplied, and reduces the number of list headers that fit in a virtual memory page, and in cache. Wasted space makes slower programs, even today; memory latency is much larger, relative to CPU clocks, than ever. Empty Objects How can this overhead be avoided? To see, you need to understand why the overhead is there. The Standard C++ language definition says: A class with an empty sequence of members and base class objects is an empty class. Complete objects and member subobjects of an empty class type shall have nonzero size. Why must objects with no member data occupy storage? Consider: struct Bar { }; struct Foo { struct Bar a[2]; struct Bar b; }; Foo f; What are the addresses of f.b and the elements of f.a[] ? If sizeof(Bar) were zero, they might all have the same address. If you were keeping track of separate objects by their addresses, f.b and f.a[0] would appear to be the same object. The Standard committee chose to finesse these issues by forbidding zero-sized addressable objects. Still, why does an empty member take up so much space (four bytes, in our example)? On all the compilers I know of, sizeof(Bar) is 1. However, on most architectures objects must be aligned according to their type. For example, if you declare struct Baz { Bar b; int* p; }; on most architectures today, sizeof(Baz) is 8. This is because the compiler adds "padding" so that member Baz::p doesn't cross a memory word boundary. (See Figure 1a.) Figure 1a: struct Baz +-----------------------------------+ | +-------+-------+-------+-------+ | | | Bar b | XXXXX | XXXXX | XXXXX | | | +-------+-------+-------+-------+ | | +-------------------------------+ | | | int* p | | | +-------------------------------+ | +-----------------------------------+ Figure 1b: struct Baz2 +-----------------------------------+ | +-------------------------------+ | | | int* p | | | +-------------------------------+ | +-----------------------------------+ How can you avoid this overhead? The Draft hints, in a footnote: A base class subobject of an empty class type may have zero size. In other words, if you declared Baz2 this way, struct Baz2 : Bar { int* p; }; then a compiler is allowed to reserve zero bytes for the empty base class Bar ; hence, sizeof(Baz2) can be just 4 on most architectures. (See Figure 1b.) Compiler implementers are not _required_ to do this optimization, and many don't, yet. However, you can expect that most Standard-conforming compilers will, because the efficiency of so many components of the Standard library (not only the containers) depends on it. Eliminating Bloat You have found the principle you need to eliminate the space overhead. How can you apply it? Let's consider how you might fix the implementation of the the example, template list<> . You could just derive from the allocator, as in Listing 3, Listing 3: A Naïve Way to Eliminate Bloat template <class T, class Alloc = allocator<T> > class list : private Alloc { . . . struct Node { . . . }; Node* head_; public: explicit list(Alloc const& a = Alloc()) : Alloc(a) { . . . } . . . }; and it would work, mostly. Code in the list<> member functions would get storage by calling " this->allocate() " instead of " alloc_.allocate() ". However, the Alloc type supplied by the user is allowed to have virtual members, and these could conflict with a derived list<> member. (To see this, imagine a private member void list<>::init() , and a virtual member bool Alloc::init() .) A much better approach is to package the allocator with a list<> data member, such as the pointer to the first list node (as in Listing 4), so that the allocator's interface cannot leak out. Listing 4: A Better Way to Eliminate Bloat template <class T, class Alloc = allocator<T> > class list { . . . struct Node { . . . }; struct P : public Alloc { P(Alloc const& a) : Alloc(a), p(0) { } Node* p; }; P head_; public: explicit list(Alloc const& a = Alloc()) : head_(a) { . . . } . . . }; Now, list<> members get storage by saying " head_.allocate() ", and mention the first list element by saying " head_.p ". This works perfectly, there's no unnecessary overhead of any kind, and users of list<> can't tell the difference. Like any good optimization, it makes the implementation a bit messier, but doesn't affect the interface. A Packaged Solution There is still room for improvement. As usual, the improvement involves a template. In Listing 5 we've packaged the technique so that it is easy and clean to use. Listing 5: Packaging the Technique template <class Base, class Member> struct BaseOpt : Base { Member m; BaseOpt(Base const& b, Member const& mem) : Base(b), m(mem) { } }; Using this template, our list<> declaration appears as in Listing 6. Listing 6: The Best Way to Eliminate Bloat template <class T, class Alloc = allocator<T> > class list { . . . struct Node { . . . }; BaseOpt<Alloc,Node*> head_; public: explicit list(Alloc const& a = Alloc()) : head_(a,0) { . . . } . . . }; This declaration of list<> is no bigger or messier than the unoptimized version we started with. Any other library component (Standard or otherwise) can use BaseOpt<> just as easily. The member code is only slightly messier; while it's not immediately obvious what's going on, the declaration of BaseOpt<> provides a natural place to document the technique and the reasons for it. It is tempting to add members to BaseOpt<> , but that would not improve it: they could conflict with members inherited from the Base parameter, just as in Listing 3. Finally This technique can be used now, under any compiler with sturdy template support. Not all C++ compilers support the empty-base optimization yet, though the Sun, HP, IBM, and Microsoft compilers do, but the technique costs nothing extra even on those that don't. When you get a Standard-conforming compiler, it probably will do the optimization, and if your code uses this technique, it will become more efficient automatically. Fergus Henderson contributed an essential refinement to this technique. Update: the latest Borland compiler is able to do the optimization, controlled by a mode switch; it defaults off in the current version. Watcom's and Symantec's compilers do the optimization. The Metaware compiler does it when running under OS/2. Gcc-3 does it. Apple's MrCpp does it. Metrowerks's 4.0 compiler does it, but (like IBM's) is a bit too aggressive and can put two subobjects of the same type at the same address; I don't know about later ones. (Other compilers that do the optimization will be listed here, as I get notified.) A Watcom engineer reports that STL benchmarks ran 30% faster after they implemented the empty-base optimization. Update again: A whole family of related "empty subobject" optimizations are possible, subject to the ABI specifications a compiler must observe. (Jason Merrill pointed some of these out to me, years back.) For example, consider three struct members of (empty) types A, B, and C, and a fourth non-empty. They may, conformingly, all occupy the same address, as long as they don't have any bases in common with one another or with the containing class. A common gotcha in practice is to have the first (or only) member of a class derived from the same empty base as the class. The compiler has to insert padding so that they two subobjects have different addresses. This actually occurs in iterator adapters that have an interator member, both derived from std::iterator. An incautiously-implemented standard std::reverse_iterator might exhibit this problem. Nathan has worked on the ISO/ANSI C++ Standard since 1993. He designed most of what is in Chapter 22 of the ISO C++ Standard. His interests include library interface design, low level code efficiency, and Free software.