We can often hear that allocation of objects is “cheap” in .NET. I fully support this sentence because the most important part is its continuation – allocation is cheap but allocating a lot of objects will hit you back as sooner or later garbage collector will kick in and start messing around. Thus, the fewer allocations, the better.

However, I would like to add a few words about “allocation is cheap” itself. This is true to some extent because the typical path of objects allocation is indeed really fast. So-called bump a pointer technique is most often used. It consists of the following simple steps:

it uses so-called allocation pointer as an address of a newly created object

it increases allocation pointer by the requested size (so next object will be created there

For example, in case of execution in Workstation GC mode, the following assembly code will be executed (in case of Server GC Mode, it is almost the same but additionally INLINE_GETTHREAD will be used to get current thread’s Thread Local Storage):

//; IN: rcx: MethodTable* //; OUT: rax: new object LEAF_ENTRY JIT_TrialAllocSFastSP, _TEXT mov r8d, [rcx + OFFSET__MethodTable__m_BaseSize] ;; r8 = size ; m_BaseSize is guaranteed to be a multiple of 8. inc [M_GCLOCK] jnz JIT_NEW mov rax, [generation_table + 0] ; alloc_ptr mov r10, [generation_table + 8] ; limit_ptr add r8, rax ;; r8 = alloc_ptr + size cmp r8, r10 ;; alloc_ptr + size < limit_ptr ? ja AllocFailed mov qword ptr [generation_table + 0], r8 ; update the alloc ptr mov [rax], rcx ; *alloc_ptr = MT mov [M_GCLOCK], -1 ret AllocFailed: mov [M_GCLOCK], -1 jmp JIT_NEW LEAF_END JIT_TrialAllocSFastSP, _TEXT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 / / ; IN: rcx: MethodTable* / / ; OUT: rax: new object LEAF_ENTRY JIT _ TrialAllocSFastSP , _ TEXT mov r 8 d , [ rcx + OFFSET _ _ MethodTable _ _ m _ BaseSize ] ;; r8 = size ; m_BaseSize is guaranteed to be a multiple of 8. inc [ M _ GCLOCK ] jnz JIT _ NEW mov rax , [ generation _ table + 0 ] ; alloc_ptr mov r 10 , [ generation _ table + 8 ] ; limit_ptr add r 8 , rax ;; r8 = alloc_ptr + size cmp r 8 , r 10 ;; alloc_ptr + size < limit_ptr ? ja AllocFailed mov qword ptr [ generation _ table + 0 ] , r 8 ; update the alloc ptr mov [ rax ] , rcx ; *alloc_ptr = MT mov [ M _ GCLOCK ] , - 1 ret AllocFailed : mov [ M _ GCLOCK ] , - 1 jmp JIT _ NEW LEAF_END JIT _ TrialAllocSFastSP , _ TEXT

Besides managing allocation pointer, .NET manages also allocation limit – a boundary of allocation context with memory being already zeroed. This allows using such memory instantly, without the overhead of zeroing it when an object is being created. Zeroing it in advance has an important advantage of warming CPU cache before accessing from our application code.

Bump a pointer technique checks whether requested size fits within allocation limit. If not, it falls back to the slower allocation path, which we can see as jumping to JIT_NEW in the code above. It is the necessity of abandoning this fast path that makes the “allocation is cheap” phrase not always true. Slow path is realized as a quite complex state machine which tries to find a place with the required size. We can see it in gc_heap::allocate_small and gc_heap::allocate_large method used for Small Object Heap and Large Object Heap respectively.

How complex is the slow path? Below I’ve illustrated state machine for the allocate_small method. It starts with a_state_start state when fast allocation described above fails. This state unconditionally changes into a_state_try_fit which calls gc_heap::soh_try_fit() method. And so the whole story begins…

Names of the states, conditions and method called are quite self-descriptive. soh_try_fit method is itself also non-trivial as it first tries to use a free-item list to find an appropriate free space for required size. It then tries to consume already committed segment memory. If it fails, it commits more from the reserved memory:

Describing all the logic here is not my point. What I would like to point out is – memory allocation is ALMOST ALWAYS cheap but… it may be also SOMETIMES SO complicated. Executing illustrated state machine has its own cost. Obviously, in the worst case, it will trigger garbage collection which is the reason why it is the best to not allocate at all in the first place. But even without GC, we can spend a while in the allocator. Thus, after digging into those methods making code less allocatey makes even more sense to me than before.

PS. There are several “allocation helpers” like JIT_TrialAllocSFastSP listed above. JIT (based on allocated object properties and runtime environment) will choose one of them. If you are interested, here’s a summary: