That’s fine, but can we do even better?

If we don not want to accept any compromises, there is definitely room for more optimizations. Let’s look at the possible bottlenecks in our code:

Multiple invocations, one for each variable getter.

Boxing: if a getter is retrieving the value of a field of a value type, right now it is also boxing it, as the return type for our functions is object .

. Pinning: allocating GCHandle instances is expensive, and pinning objects should generally be avoided, if possible.

instances is expensive, and pinning objects should generally be avoided, if possible. Repeated calculations: our GetData method is doing a number of operations every time it is invoked, like iterating over the list of captured fields, checking each field to see if it is a value type, calculating the right byte offset to write the value type fields into the byte array, etc.

What we want is a single method that takes our object instance and retrieves the values of all the captured fields one after the other. We can do this just like we did the loop unrolling to traverse the hierarchy of each captured field. To solve the issue of value types being boxed, we can avoid returning the values of our captured fields entirely: we can instead generate a method that takes our object and byte arrays and writes the field values directly from there. We will actually go one step further, and have our method take ref parameters to the start of our arrays. This way we will be able to assign values directly to each memory address without the JIT compiler adding safety bound checks, which it does whenever we use the T[int] indexer for array types.

We will first need to write a few extension methods for the ILGenerator class, so that the IL generation for this dynamic method will be easier to manage. We are going to perform a much higher number of operations in IL this time around, so we will definitely need them.

Let’s write a method to store a local value. As mentioned above, local variables in IL are accessed by their index. The instruction we need is stloc.

Here we first note one peculiar feature of some IL instructions: in some cases we have at our disposal “multiple versions” of a given opcode that we can use to perform some operations faster. For instance, the stloc.0 opcode stores the value on top of the execution stack on the local variable with index 0, and it is faster than the standard stloc variant as it does not need to also load the target index as a parameter: it is embedded into the instruction itself. This method makes sure to always use the fastest opcode for the index we need. We also need an EmitLoadLocal method, which will be structured just like this one, with the only difference being the fact that it will use the ldloc opcode and its variants instead of stloc.

We then need an extension to replace Unsafe.Add<T>(ref T, int) calls. Our delegate is receiving a pair of ref parameters pointing to the first element in each array, and we want to be able to move those references ahead by a given offset to access other elements of those arrays.

Here we are first using the ldc.i4 instruction, which pushes an int value to the top of the stack. As before, we are also making sure to always use the fastest possible opcodes, which means using the explicit versions ldc.i4.0 through ldc.i4.8 for values in the [0,8] range, and the ldc.i4.s opcode for values that fit in a sbyte parameter. Then, we use conv.i to convert our loaded offset (which is of type int) to a native int, which is a type that simply represents a memory address, and whose size depends on the CPU architecture the code is running upon. Now we can use add, and we end up with our shifted reference on the top of the execution stack.

The last piece of the puzzle is a method to write a value at the location pointed by a given reference. This method will assume that the execution stack will have a reference and then a value of a specified type at the top of the execution stack, and will make sure to write the value to the target location using the appropriate instruction:

In this method, we first need to check whether the value we are trying to set is a value type. If it is not, then we just need the stind.ref opcode, which simply stores an object reference at a given memory location. If it is, we need to check whether a dedicated opcode is available for our current data type: stind.i4 for int values, stind.r4 for float values, etc. If that is not the case, we can fallback to the stobj opcode, which despite its confusing name is actually used to copy value types to a supplied memory address. Lastly, we need to check whether the selected opcode is stobj, in which case we will also need to pass a type token to indicate the value of the type being assigned, otherwise the runtime will not be able to properly execute that particular instruction.

We now have all the building blocks we need, and we just need to put them together. We also need to define a custom delegate that will wrap our dynamic methods, to be able to specify two of the input parameters as ref parameters, and then we can finally build our updated IL method:

As before, the first step is creating a DynamicMethod, making sure to properly specify its signature, including ref parameters. After that, we need to create a mapping for the root closure type and all the nested closure types, so that we can assign a unique index to each of them, which will correspond to their location in our list of local variables in IL. Then we use an HashSet<T> to keep track of the indices of the local variables that have already been initialized. Once we have our mappings, we define a local method that we will need to load a given field. This method will check whether the parent instance of that field has already been loaded, and if it is not, it will go back to the most in depth loaded instance and will resume the traversal from there, until the needed instance is assigned to its corresponding local variable and loaded on the execution stack.

We start the construction of the IL method by declaring all the local variables we need, and then we load the input object instance, cast it to the right type and store it in the first local variable. After that, we can iterate over the captured fields, and prepare the target memory reference to write to. If the field is a value type, we load our ref byte parameter and shift it ahead by the current offset we are at in the byte[] array, and then we update it by retrieving the size of the field type. If it is a reference type, we instead load the ref object parameter. To update the offset into the object[] array, we are using the size of the object type, which we can retrieve with the Unsafe.SizeOf<T>() method. This works for every other reference type as well: a memory address has a fixed size that only depends on the CPU architecture. Once this is done, we can call our LoadField(ClosureField) method defined above and our extension to write the loaded field value to the target memory location.

We are almost done now, we just need to change our GetData(Delegate) method so that it can use the delegate we’ve just created:

The first thing you might notice here is that our method is also taking the number of captured reference and value type variables as parameters, instead of calculating them. This is another optimization we can add, since those values only need to be calculated once instead of every time we want to extract data from a given closure instance. We can apply this same optimization to the previous variants as well, to prevent this version from having an unfair advantage over the others. Looking at the rest of the method body, we are creating our object[] and byte[] arrays as usual, and then getting references to the first item of each array. If either of the arrays is empty, we do not access it and just use a null reference instead, obtained with Unsafe.AsRef<T>(null). Then we call our latest dynamic IL method, which will take care of all the rest.

And that’s it! We have gone all the way from a purely reflection-based approach to solve this problem, to an optimized version that not only ditches reflection entirely, but memory allocations too.