This is an indepth post about advanced memory topics in C++ such as placement new and alignment. The topics covered here will be used in a future post to create a memory pool.

Memory

Generally, in C++ there are three places you can store your data:

On the stack (local variables) On the heap (new / delete) In the static data section (static variables)

Normally, when using the heap, you would use the following command:

Foo* foo_ptr = new Foo();

This would invoke the following actions behind the scenes:

Make an operating system call to allocate a chunk of memory of size sizeof(Foo) Manage that memory with the heap Call the constructor of Foo to build an object at that memory location Initialize foo_ptr with the address of the object

Later on, you would call:

delete foo_ptr;

This would:

Call the destructor of Foo Have the heap give the memory back to the operating system

This is how new and delete work, and most C++ programmers should be familiar with them.

New and Malloc

You should never delete memory allocated with malloc , and you should never free memory allocated with new . malloc and free do raw memory allocations, while new and delete are also responsible for calling constructors and destructors.

Placement New

C++ offers a different “flavor” of new called “placement new”. Placement new gives the user finer control about where the object gets constructed by allowing the object to be “placed” at a specified memory address; in other words, the heap allocation step is bypassed.

Placement new has a few specialized uses. One is memory mapped I/O for embedded systems. This is used when an object must exist at a specific memory address in order for its members to be mapped to external sensors or control pins. Another use case is for memory pools, when the application is responsible for managing its memory instead of relying on the heap.

The syntax for placement new is:

unsigned char memory[sizeof(Foo)]; Foo* foo_ptr = new (memory) Foo();

In this example, Foo was constructed into the memory on the stack, instead of on the heap. It is important to understand that using placement new requires the developer to do its own memory management. memory must be large enough to contain a Foo , and you should never call delete on the pointer returned by placement new.

Delete

Remember that delete calls the destructor, and has the heap give memory back to the operating system. However, the memory here isn’t on the heap! Calling delete on foo_ptr will cause a crash.

If you think of placement new as simply a constructor call, then it follows that a “placement delete” would simply be a destructor call. And an explicit destructor call is exactly how to clean up after a placement new:

foo_ptr->~Foo();

You must not forget this step; it isn’t a memory leak, but it would be a resource leak.

Here is a more detailed (runnable) example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 #include <new> #include <iostream> #include <cstdio> struct Foo { Foo ( void ) : value ( 0 ) { std :: cout << "Constructor" << std :: endl ; } ~ Foo ( void ) { std :: cout << "Destructor" << std :: endl ; } int value ; }; int main ( int argc , char * argv []) { // Increase scope { // Allocate memory on the stack unsigned char memory [ sizeof ( Foo )]; // Construct a Foo inside that memory Foo * foo_ptr = new ( memory ) Foo (); // Show that the memory addresses are the same printf ( "0x%016lX

" , memory ); printf ( "0x%016lX

" , foo_ptr ); // Call the destructor explicitly foo_ptr ->~ Foo (); } // Memory has been deallocated from the stack }

This outputs:

Constructor 0x00007FFFFFFFE280 0x00007FFFFFFFE280 Destructor

Note that the memory isn’t required to be allocated on the stack; you can just as easily allocate the memory using malloc, and later deallocate the memory with free. In fact, you can think of the regular new call as performing the following actions:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 #include <new> #include <iostream> #include <cstdio> struct Foo { Foo ( void ) : value ( 0 ) { std :: cout << "Constructor" << std :: endl ; } ~ Foo ( void ) { std :: cout << "Destructor" << std :: endl ; } int value ; }; int main ( int argc , char * argv []) { // Allocate memory on the heap unsigned char * memory = static_cast < unsigned char *> ( malloc ( sizeof ( Foo ))); // Construct a Foo inside that memory Foo * foo_ptr = new ( memory ) Foo (); // Show that the memory addresses are the same printf ( "0x%016lX

" , memory ); printf ( "0x%016lX

" , foo_ptr ); // Call the destructor explicitly foo_ptr ->~ Foo (); // Deallocate the memory from the heap free memory ; }

This outputs:

Constructor 0x0000000000664370 0x0000000000664370 Destructor

Dumping Memory

As I said earlier, placement new requires the developer to manage their own memory even more than usual. When working with memory on a low level like this, it becomes very useful to see a hex dump similar to the one created by GDB. With some inspiration1, I created a function that would pretty-print memory to an output stream:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 void dump_memory ( void * ptr , std :: size_t size , std :: ostream & os = std :: cout ) { typedef unsigned char byte ; typedef unsigned long uint ; // Allow direct arithmetic on the pointer uint iptr = reinterpret_cast < uint > ( ptr ); os << "-----------------------------------------------------------------------

" ; os << boost :: format ( "%d bytes" ) % size ; // Get number of digits uint indent = std :: log10 ( size ) + 1 ; // Write the address offsets along the top row // Account for the indent of "X bytes" os << std :: string ( 13 - indent , ' ' ); for ( std :: size_t i = 0 ; i < 16 ; ++ i ) { if ( i % 4 == 0 ){ os << " " ;} // Spaces between every 4 bytes os << boost :: format ( " %2hhX" ) % i ; // Write the address offset } // If the object is not aligned if ( iptr % 16 != 0 ) { // Print the first address os << boost :: format ( "

0x%016lX:" ) % ( iptr & ~ 15 ); // Indent to the offset for ( std :: size_t i = 0 ; i < iptr % 16 ; ++ i ) { os << " " ; if ( i % 4 == 0 ){ os << " " ;} } } // Dump the memory for ( std :: size_t i = 0 ; i < size ; ++ i , ++ iptr ) { // New line and address every 16 bytes, spaces every 4 bytes if ( iptr % 16 == 0 ){ os << boost :: format ( "

0x%016lX:" ) % iptr ;} if ( iptr % 4 == 0 ){ os << " " ;} // Write the address contents os << boost :: format ( " %02hhX" ) % static_cast < uint > ( * reinterpret_cast < byte *> ( iptr )); } os << "

-----------------------------------------------------------------------" << std :: endl ; }

Here is an example of how to use it:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 struct Test { char a ; // 1 byte int b ; // 4 bytes short c ; // 2 bytes long d ; // 8 bytes Test () : a ( 0x11 ), b ( 0x22222222 ), c ( 0x3333 ), d ( 0x4444444444444444 ) { } }; int main ( int argc , char * argv []) { unsigned char memory [ sizeof ( Test )]; Test * ptr = new ( memory ) Test (); dump_memory ( ptr , sizeof ( Test )); ptr ->~ Test (); }

And here is the output:

----------------------------------------------------------------------- 24 bytes 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x00007FFFFFFFE270: 11 FF FF FF 22 22 22 22 33 33 00 00 00 00 00 00 0x00007FFFFFFFE280: 44 44 44 44 44 44 44 44 -----------------------------------------------------------------------

Padding

If you are surprised by the output above, you may not be aware of padding. The compiler will add padding bytes between members in a structure to ensure that each member starts on a proper byte boundary. This means structures may take up more space than if they were packed tightly. Primitive types in a structure will be padded such that they are aligned on byte boundaries that match their size2: A char (one byte) will be 1-byte aligned.

A short (two bytes) will be 2-byte aligned.

An int (four bytes) will be 4-byte aligned.

A long (eight bytes) will be 8-byte aligned.

A float (four bytes) will be 4-byte aligned.

A double (eight bytes) will be 8-byte aligned.

When working with memory pools, the allocated memory will typically be larger than the object itself. It becomes useful to “mark” the memory with special byte patterns for ease of debugging. I like to use:

0xCC for “Clear”

0xAA for “Allocated”

0xDD for “Deallocated”

For example, lets allocate more memory than we need, and construct the object in the middle of the array. The markers will indicate what is happening:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 int main ( int argc , char * argv []) { // Get the size of our structure unsigned int size = sizeof ( Test ); // Reserve more memory than we need unsigned char memory [ size * 2 ]; // Fill that memory up with "C" for "Cleared" memset ( memory , 0xCC , sizeof ( memory )); // Determine where to construct the object unsigned int offset = size / 2 ; // Mark that area as "Allocated" memset ( memory + offset , 0xAA , size ); // Construct an object offset into memory Test * ptr = new ( memory + offset ) Test (); // Lets see what it looks like dump_memory ( ptr , sizeof ( Test )); // Destroy the object ptr ->~ Test (); }

Outputs:

----------------------------------------------------------------------- 24 bytes 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x00007FFFFFFFE220: 11 AA AA AA 0x00007FFFFFFFE230: 22 22 22 22 33 33 AA AA AA AA AA AA 44 44 44 44 0x00007FFFFFFFE240: 44 44 44 44 -----------------------------------------------------------------------

A couple things to notice here:

Only the memory for the object is printed, that is why we don’t see any 0xCC

The object was offset 12 bytes into the memory

The padding bytes are filled with 0xAA as expected

It would be kind of nice to see the memory around the object, to give us some context. One solution is to dump the memory variable instead of the ptr variable, but with large memory pools this can be too much to look at. Instead, lets just print some additional local context:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 void dump_memory_with_context ( void * ptr , std :: size_t size , std :: ostream & os = std :: cout ) { // Allow direct arithmetic on the pointer unsigned long sptr = reinterpret_cast < unsigned long > ( ptr ); // Start pointer unsigned long eptr = sptr + size ; // End pointer sptr &= ~ 15 ; // Round down to the last multiple of 16 sptr -= 16 ; // Step back one line for context eptr &= ~ 15 ; // Round down to the last multiple of 16 eptr += 32 ; // Step forward one line for context // Dump memory dump_memory ( reinterpret_cast < void *> ( sptr ), eptr - sptr , os ); } int main ( int argc , char * argv []) { // Get the size of our structure unsigned int size = sizeof ( Test ); // Reserve more memory than we need unsigned char memory [ size * 2 ]; // Fill that memory up with "C" for "Cleared" memset ( memory , 0xCC , sizeof ( memory )); // Determine where to construct the object unsigned int offset = size / 2 ; // Mark that area as "Allocated" memset ( memory + offset , 0xAA , size ); // Construct an object offset into memory Test * ptr = new ( memory + offset ) Test (); // Lets see what it looks like dump_memory_with_context ( ptr , sizeof ( Test )); // Destroy the object ptr ->~ Test (); }

Now we get:

----------------------------------------------------------------------- 80 bytes 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x00007FFFFFFFE210: A0 E2 FF FF FF 7F 00 00 6C 6B 40 00 00 00 00 00 0x00007FFFFFFFE220: CC CC CC CC CC CC CC CC CC CC CC CC 11 AA AA AA 0x00007FFFFFFFE230: 22 22 22 22 33 33 AA AA AA AA AA AA 44 44 44 44 0x00007FFFFFFFE240: 44 44 44 44 CC CC CC CC CC CC CC CC CC CC CC CC 0x00007FFFFFFFE250: 88 E3 FF FF FF 7F 00 00 75 94 42 00 01 00 00 00 -----------------------------------------------------------------------

This shows a local context memory dump that includes our object and the memory around it. Our markers can be clearly seen, as well as the object itself. In addition, we see some other garbage from the stack.

Alignment

In the last section, we constructed an object offset into our memory area, and the result was misaligned in the stack. This situation is not ideal.

Memory alignment is important because while programmers think in terms of bytes, CPUs think in terms of words. On a 64-bit system, this means the CPU will load 8 bytes at a time from memory. If you have an unaligned member that straddles two words, the CPU needs to make twice as many memory accesses as it would have if the memory was properly aligned.3

GCC will typically handle memory alignment for you when allocating from the heap, but in the case of a memory pool it is up to the developer to handle it properly. There are two rules to follow when aligning memory addresses:

The alignment boundary must be greater or equal to the size of a pointer

The alignment boundary must be a power of 2

One fast and generic method for aligning memory is to allocate a little bit more than needed in order to get to the next boundary, then store the original pointer in the space that was skipped over. It is important to keep the original pointer around because it will be needed when it is time to free that memory.

Below is a modification of an alignment algorithm found in the Eigen4 library. The following enhancements were made:

Allow for user specified alignment

Add compile time checks for alignment rules

Provide a method for accessing the unaligned pointer

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 #include <boost/mpl/assert.hpp> #include <boost/mpl/int.hpp> #include <boost/mpl/comparison.hpp> #include <boost/mpl/bitwise.hpp> #include <boost/mpl/arithmetic.hpp> template < std :: size_t bytes = 0 > struct boundary { // If an alignment is specified, it must be greater than or equal to // the size of a pointer. BOOST_MPL_ASSERT (( boost :: mpl :: greater_equal < boost :: mpl :: int_ < bytes > , boost :: mpl :: int_ < sizeof ( void * ) > > )); // The alignment bytes must be a power of two // (n & (n-1)) == 0 BOOST_MPL_ASSERT (( boost :: mpl :: equal_to < boost :: mpl :: bitand_ < boost :: mpl :: int_ < bytes > , boost :: mpl :: minus < boost :: mpl :: int_ < bytes > , boost :: mpl :: int_ < 1 > > > , boost :: mpl :: int_ < 0 > > )); // In order for this to work, must allocate additional bytes enum { value = bytes }; // Get the next aligned pointer static void * next ( void * ptr ) { // Round down to the previous multiple of X, // then move to the next multiple of X. return reinterpret_cast < void *> ( ( reinterpret_cast < std :: size_t > ( ptr ) & ~ ( std :: size_t ( value - 1 ))) + value ); } // Return an aligned pointer static void * align ( void * ptr ) { // Get the next aligned pointer void * aligned_ptr = next ( ptr ); // Save the original pointer in the space we skipped over * ( reinterpret_cast < void **> ( aligned_ptr ) - 1 ) = ptr ; // Return the aligned pointer return aligned_ptr ; } // Retrieve the original pointer static void * unalign ( void * ptr ) { return * ( reinterpret_cast < void **> ( ptr ) - 1 ); } }; // Specialize to not attempt alignment template <> struct boundary < 0 > { enum { value = 0 }; static void * next ( void * ptr ){ return ptr ;} static void * align ( void * ptr ){ return ptr ;} static void * unalign ( void * ptr ){ return ptr ;} };

Using it is simple: Any pointer can be aligned on a boundary, and any aligned pointer can be unaligned. However, it is important to remember that extra space must be allocated before attempting to align the pointer. The following example will show how to align a structure on a 16 byte boundary:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 int main ( int argc , char * argv []) { // Get the size of our structure unsigned int size = sizeof ( Test ); // Purposely offset the object by 8 bytes unsigned int offset = 8 ; // Reserve size for our structure // plus alignment overhead // plus our test offset unsigned char * memory = static_cast < unsigned char *> ( malloc ( size + boundary < 16 >:: value + offset )); // Print out the address of the malloc'd memory pointer printf ( "Memory: 0x%016lX

" , memory ); // Fill that memory up with "C" for "Cleared" memset ( memory , 0xCC , sizeof ( memory )); // Mark the object area as "Allocated" memset ( memory + offset , 0xAA , size + boundary < 16 >:: value ); // Print out the address of where we will construct the object printf ( "Memory + Offset: 0x%016lX

" , memory + offset ); // Construct an object offset into memory, // but use pointer alignment to realign it Test * ptr = new ( boundary < 16 >:: align ( memory + offset )) Test (); // Print out the address of the aligned pointer printf ( "Ptr: 0x%016lX

" , static_cast < void *> ( ptr )); // Print out the address of the unaligned pointer printf ( "Unaligned Ptr: 0x%016lX

" , boundary < 16 >:: unalign ( ptr )); // Lets see what it looks like dump_memory_with_context ( ptr , sizeof ( Test )); // Destroy the object ptr ->~ Test (); // Unalign the aligned pointer to get the original pointer back // then remove the offset to properly free free ( boundary < 16 >:: unalign ( ptr ) - offset ); }

This outputs:

Memory: 0x00000000006648F0 Memory + Offset: 0x00000000006648F8 Ptr: 0x0000000000664900 Unaligned Ptr: 0x00000000006648F8 ----------------------------------------------------------------------- 64 bytes 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x00000000006648F0: CC CC CC CC CC CC CC CC F8 48 66 00 00 00 00 00 0x0000000000664900: 11 AA AA AA 22 22 22 22 33 33 AA AA AA AA AA AA 0x0000000000664910: 44 44 44 44 44 44 44 44 AA AA AA AA AA AA AA AA 0x0000000000664920: 4B 00 00 00 00 00 00 00 31 00 00 00 00 00 00 00 -----------------------------------------------------------------------

So what happened here? Here is the output color-coded:

Memory Dump

Red: The debug markers

Orange: The 16-byte aligned pointer address

Yellow: The test object

Green: The original unaligned / offset pointer address

Blue: The beginning of the memory

Now, remember that we offset the object 8 bytes into the beginning of the memory, which would be 0x006648F8 . The object was then aligned to the next 16-byte boundary, starting at 0x00664900 (16-byte aligned addresses always end in 0). In order to do this, an extra 16 bytes needed to be allocated, which can be seen with the red 0xAA markers. Finally, the original pointer of 0x006648F8 was saved immediately before the aligned pointer in little endian (which in this case happens to have the same address as itself).

Endianness

Endianness is the ordering of bytes in a word (On a 64-bit machine, there are 8 bytes in a word)5. The bytes can be ordered in one of two directions: Big Endian: The most significant byte is stored at the smallest memory address (the “Big End” first)

Little Endian: The least significant byte is stored at the smallest memory address (the “Little End” first) Big Endian is the more intuitive method, as it matches reading left to right. It is chosen as the standard for transmitting words over a network, and is also known as “Network Byte Order”. Little Endian is more difficult for a human to read, but it has performance advantages and desirable properties. For example, a 32-bit memory location with content 4A 00 00 00 can be read at the same address as either 8-bit (value = 0x4A ), 16-bit ( 0x004A ), or 32-bit ( 0x0000004A ), all of which retain the same numeric value. Intel’s x86/x64 architecture is little endian. It is important to be aware that endianness doesn’t affect byte arrays or strings.

Here is what the memory would look like if we did the alignment without the forced 8 byte offset:

Wasted Space

As you can see, even though the memory was already aligned, 16 bytes were wasted in order to get to the next aligned address. You can also see the original pointer address stored immediately before the aligned address. If we were doing 8 byte alignment, the extra 8 bytes that are wasted to align on the next boundary would contain the original address. Attempting to perform an alignment less than 8 means that there wouldn’t be enough space for the original pointer; however, one of the rules of alignment states that the alignment must be greater than or equal to the size of a pointer, so we don’t need to worry about that case.

What about larger alignments? Here is what it would look like if we were doing 32-byte alignment with an 8 byte offset:

Large Alignment

Generally, 32-byte alignment is unnecessary. There are cases where 16-byte alignment is needed though, such as when dealing with the SSE instructions on the CPU (for example, matrix multiplication with the Eigen library).