In this final episode of our IL2CPP micro-optimization miniseries, we’ll explore the high cost of something called “boxing”, and we’ll see how IL2CPP can avoid it when it is done unnecessarily.

Heap allocations are slow

Like many programming languages, C# allows the memory for objects to be allocated on the stack (a small, “fast”, scope-specific, block of memory) or the heap (a large, “slow”, global block of memory). Usually allocating space for an object on the heap is much more expensive than allocating space on the stack. It also involves tracking that allocated memory in the garbage collector, which has an additional cost. So we try to avoid heap allocations where possible.

C# lets us do this by separating types into value types (which can be allocated on the stack), and reference types (which must be allocated on the heap). Types like int and float are value types, string and object are reference types. User-defined value types use the struct keyword. User-defined reference types use the class keyword. Note that a value type can never hold a the value null. In C#, the null value can only be assigned to reference types. Keep this distinction in mind as we continue.

Being good performance citizens, we try to avoid heap allocations unless they are necessary. But sometimes we need to convert a value type on the stack into a reference type on the heap. This process is called boxing. Boxing:

Allocates space on the heap Informs the garbage collector about the new object Copies the data from the value type object into the new reference type object

Ugh, let’s add boxing to our list of things to avoid!

That pesky compiler

Suppose we are happily writing code, avoiding unnecessary heap allocations and boxing. Maybe we have some trees for our world, and each has a size which scales with its age:

interface HasSize { int CalculateSize(); } struct Tree : HasSize { private int years; public Tree(int age) { years = age; } public int CalculateSize() { return years*3; } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 interface HasSize { int CalculateSize ( ) ; } struct Tree : HasSize { private int years ; public Tree ( int age ) { years = age ; } public int CalculateSize ( ) { return years* 3 ; } }

Elsewhere in our code, we have this convenient method to sum up the size of many things (including possibly Tree objects):

public static int TotalSize<T>(params T[] things) where T : HasSize { var total = 0; for (var i = 0; i < things.Length; ++i) if (things[i] != null) total += things[i].CalculateSize(); return total; } 1 2 3 4 5 6 7 8 public static int TotalSize < T > ( params T [ ] things ) where T : HasSize { var total = 0 ; for ( var i = 0 ; i < things . Length ; ++ i ) if ( things [ i ] != null ) total += things [ i ] . CalculateSize ( ) ; return total ; }

This looks safe enough, but let’s peer into a little bit of the Intermediate Language (IL) code that the C# compiler generates:

// This is the start of the for loop // Load the array IL_0009: ldarg.0 // Load the current index IL_000a: ldloc.1 // Load element at the current index IL_000b: ldelem.any !!T // What is this box call doing in here?!? // (Hint: see the null check in the C# code) IL_0010: box !!T IL_0015: brfalse IL_002f // Set up the arguments for the method and it call IL_001a: ldloc.0 IL_001b: ldarg.0 IL_001c: ldloc.1 IL_001d: ldelema !!T IL_0022: constrained. !!T IL_0028: callvirt instance int32 Unity.IL2CPP.IntegrationTests.Tests.ValueTypeTests.ValueTypeTests/ IHasSize::CalculateSize() IL_002f: // Do the next loop iteration... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 // This is the start of the for loop // Load the array IL_0009 : ldarg . 0 // Load the current index IL_000a : ldloc . 1 // Load element at the current index IL_000b : ldelem . any ! ! T // What is this box call doing in here?!? // (Hint: see the null check in the C# code) IL_0010 : box ! ! T IL_0015 : brfalse IL_002f // Set up the arguments for the method and it call IL_001a : ldloc . 0 IL_001b : ldarg . 0 IL_001c : ldloc . 1 IL_001d : ldelema ! ! T IL_0022 : constrained . ! ! T IL_0028 : callvirt instance int32 Unity . IL2CPP . IntegrationTests . Tests . ValueTypeTests . ValueTypeTests / IHasSize :: CalculateSize ( ) IL_002f : // Do the next loop iteration...

The C# compiler has implemented the if (things[i] != null) check using boxing! If the type T is already a reference type, then the box opcode is pretty cheap – it just returns the existing pointer to the array element. But if type T is a value type (like our Tree type), then that box opcode is very costly. Of course, value types can never be null, so why do we need to implement the check in the first place? And what if we need to compute the size of one hundred Tree objects, or maybe one thousand Tree objects? That unnecessary boxing will quickly become very important.

The fastest code is anything you don’t execute

The C# compiler needs to provide a general implementation that works for any possible type T, so it is stuck with this slower code. But a compiler like IL2CPP can be a bit more aggressive when it generates code that will be executed and when it doesn’t generate the code that won’t!

IL2CPP will create an implementation of The TotalSize<T> method specifically for the case where T is a Tree. the IL code above looks like this in generated C++ code:

IL_0009: // Load the array TreeU5BU5D_t4162282477* L_0 = ___things0; // Load the current index int32_t L_1 = V_1; NullCheck(L_0); IL2CPP_ARRAY_BOUNDS_CHECK(L_0, L_1); int32_t L_2 = L_1; // Load the element at the current index Tree_t1533456772 L_3 = (L_0)->GetAt(static_cast<il2cpp_array_size_t>(L_2)); // Look Ma, no box and no branch! // Set up the arguments for the method and it call int32_t L_4 = V_0; TreeU5BU5D_t4162282477* L_5 = ___things0; int32_t L_6 = V_1; NullCheck(L_5); IL2CPP_ARRAY_BOUNDS_CHECK(L_5, L_6); int32_t L_7 = Tree_CalculateSize_m1657788316((Tree_t1533456772 *)( (L_5)->GetAddressAt(static_cast<il2cpp_array_size_t>(L_6))), /*hidden argument*/NULL); // Do the next loop iteration... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 IL_0009 : // Load the array TreeU5BU5D_t4162282477* L_0 = ___things0 ; // Load the current index int32_t L_1 = V_1 ; NullCheck ( L_0 ) ; IL2CPP_ARRAY_BOUNDS_CHECK ( L_0 , L_1 ) ; int32_t L_2 = L_1 ; // Load the element at the current index Tree _ t1533456772 L_3 = ( L_0 ) -> GetAt ( static_cast < il2cpp_array_size_t > ( L_2 ) ) ; // Look Ma, no box and no branch! // Set up the arguments for the method and it call int32_t L_4 = V_0 ; TreeU5BU5D_t4162282477* L_5 = ___things0 ; int32_t L_6 = V_1 ; NullCheck ( L_5 ) ; IL2CPP_ARRAY_BOUNDS_CHECK ( L_5 , L_6 ) ; int32_t L_7 = Tree_CalculateSize_m1657788316 ( ( Tree_t1533456772 * ) ( ( L_5 ) -> GetAddressAt ( static_cast < il2cpp_array_size_t > ( L_6 ) ) ) , /*hidden argument*/ NULL ) ; // Do the next loop iteration...

IL2CPP recognized that the box opcode is unnecessary for a value type, because we can prove ahead of time that a value type object will never be null. In a tight loop, this removal of an unnecessary allocation and copy of data can have a significant positive impact on performance.

Wrapping up

As with the other micro-optimizations discussed in this series, this one is a common optimization for .NET code generators. All of the scripting backends used by Unity currently perform this optimization for you, so you can get back to writing your code.

We hope you have enjoyed this miniseries about micro-optimizations. As we continue to improve the code generators and runtimes used by Unity, we’ll offer more insight into the micro-optimizations that go on behind the scenes.