One of the key performance optimization techniques involves reducing the number of dynamic memory allocation that a program does. The reasons are:

Generic memory allocations are relatively slow

Heap-allocated objects reduce the effectiveness of CPU caches (misses are more probable)

There are numerous techniques to alleviate allocations, but not allocating memory from the heap in the first place would be ideal. This is not possible in most cases when the data we’ll manipulate is truly dynamic, but we can identify common cases where 99% of the time an allocation can be avoided – the size is often the same.

C++ strings are a prime example of objects that would allocate too often. A naive implementation would always have the data of the string on the heap, which will require always at least 1 allocation and possibly cache misses when code accesses that content. Developers noticed that and now effectively all std::string implementations implement the so called “small string optimization”. Each string object contains a small inline buffer that will hold the data, if it fits. It’ll allocate only if the data can’t fit that buffer. As most strings in practice are small – they’ll save an allocation and improve cache locality very often.

Small Vectors

The same technique can be implemented for the more generic vector type. It’s also widely implemented in many libraries. A very inspirational lecture on the successes of using “small vector optimization” is given by Chandler Carruth in his talk “High Performance Code 201: Hybrid Data Structures” where he discusses its application in LLVM.

Instead of always allocating the buffer for the objects in the vector on the heap – we have a small inline piece of memory that can hold N objects. The N constant depends on the usage and should be carefully selected depending on the context to trade between allocations and the potential memory waste if we have less than N objects.

I decided to apply the optimization in certain areas of our Hummingbird software and measure the gains. In our “standard” library I implemented a std::vector substitute creatively named “dyn_array” from dynamic array. It has the same public interface as std::vector, so they are easily interchangeable.

It’s declared like this:

template<typename T, typename Allocator, size_t InlineCapacity = 0> class dyn_array

The interesting part is the InlineCapacity param, which defines how many objects will be inline in the structure itself. When InlineCapacity is 0, dyn_array works like a standard vector – everything is allocated through the Allocator – usually from some heap.

To save memory within the dyn_array object, the inline buffer is used also as the pointer to the heap memory when necessary. The declaration is this beauty:

using DataStorage = char [Max_eval<sizeof(T) * InlineCapacity, sizeof(T*)>::value]; alignas(Max_eval<alignof(T), alignof(T*)>::value) DataStorage m_DataStorage = {0};

Edit: @lectem suggested that the same effect is achievable with std::aligned_storage

Max_eval is a meta-function, which returns the maximum of two compile-time values. The DataStorage should be big enough to hold at least a pointer, while the field itself has to be aligned properly for the inline objects.

I use the current capacity and the InlineCapacity constants to decide how to interpret the m_DataStorage. If the CurrentCapacity <= InlineCapacity, then the memory is “inline” and it’s interpreted as an array, otherwise the bytes are interpreted as a T*.

When looking at the potential usages in our code, the first target I had in mind were the CSS declarations (Hummingbird is an HTML rendering engine – like a web browser) for animations. Users can have as many animation properties within a declaration as they want – so the values have to be dynamic, but they very rarely exceed 1-4. Some declarations are also enums, which we limit to 1 byte, this means that on 32-bit targets we can squeeze 4 values within a pointer and on 64-bit 8 – more than enough in 90+% of the cases.

I wrote a small template meta-function that simplifies setting the inline size when you want it to fit exactly in one pointer. When the objects you are going to have in the dyn_array are smaller than a pointer, they are essentially free memory-wise.

The meta-function looks like this:

template<typename T> struct FitInPointer { static const auto value = sizeof(T*) / sizeof(T); };

and is used like this:

dyn_array<Value, CSSAllocator, csl::FitInPointer<Value>::value> Values;

So far I’ve applied the dyn_array in all animation CSS properties and the result have been more than encouraging – reduction of dynamic allocation by 20% in tests and a page load-time reduction of 10%.

I’m now looking at other places where the optimization will have good effects.

Future direction

The next step will be to implement a debugging mechanism that detects vectors that are often re-allocated in test scenarios. They will be analyzed for opportunities of inline sizes. The LLVM folks implemented this to great effect. It’ll be a piece of cake for us, as adding the required debug code in dyn_array is far easier than modifying a STL vectors implementation.