Background

As it is the case in most scenarios the source of latency is usually due to the .NET Garbage Collection, so we always try very hard to minimize object allocations specially in our performance-critical components.

We have a messaging API which is used by many of our micro services. Some of those services use the API to publish thousands of messages per second to the market so it is important to keep the latency as low as possible.

In my quest to reduce latency and CPU consumption of our components, I came across a very interesting and probably not so well known performance gotcha!

Problem

The API has a method which accepts an IDictionary<string, string> and publishes them to a message bus like so:

public void Publish(IDictionary<string, string> data) { foreach (var item in data) { _publisher.Publish(item); } }

Nothing really complicated right? WRONG!

What's going on?

Let's do a quick and dirty profiling of an even simpler example:

void Main() { Dictionary<int, int> dic = new Dictionary<int, int>(); var sw = Stopwatch.StartNew(); for (var i = 0; i < 100000000; i++) { foreach (var item in dic) { ; } } sw.Stop(); // compose the result var result = new StringBuilder(); result.AppendFormat("[Time]\t{0}", sw.Elapsed); result.AppendLine(); result.AppendFormat("[Gen-0]\t{0}", GC.CollectionCount(0)); result.AppendLine(); result.AppendFormat("[Gen-1]\t{0}", GC.CollectionCount(1)); result.AppendLine(); result.AppendFormat("[Gen-2]\t{0}", GC.CollectionCount(2)); Console.WriteLine(result.ToString()); }

What we have here is an empty dictionary where both the keys and values are of type int . We are then simply enumerating the dictionary a hundred million times. Note we are not doing anything in the body of the foreach so that we can focus only on the enumeration.

Finally, we record the total time it took for the enumeration as well as the number of times GC performed collections of our GEN-0 to GEN-2 . Running the above on a machine with a Quad-Core 3.4GHz and 8GB of RAM prints:

[Time] 00:00:01.8128806 [Gen-0] 0 [Gen-1] 0 [Gen-2] 0

So it took almost 2 seconds and resulted in no Garbage Collection, so far so good. Now let us try again but this time change our dic pointer to an IDictionary<int, int> instead:

... same as above IDictionary<int, int> dic = new Dictionary<int, int>(); ... same as above

Which prints:

[Time] 00:00:04.3902405 [Gen-0] 765 [Gen-1] 2 [Gen-2] 1

This time it took more than twice than enumerating a Dictionary<int, int> and resulted in 765 collections of GEN-0 , 2 collections of GEN-1 and 1 collection of GEN-2 . This is MADNESS!

And here is why

Let us look at the IL generated in the case when we are using a Dictionary<int, int> :

... not important IL_0013: ldloc.0 // dic IL_0014: callvirt System.Collections.Generic.Dictionary<System.Int32,System.Int32>.GetEnumerator IL_0019: stloc.s 04 IL_001B: br.s IL_0029 IL_001D: ldloca.s 04 IL_001F: call System.Collections.Generic.Dictionary<System.Int32,System.Int32>+Enumerator.get_Current IL_0024: stloc.s 05 // item ... IL_0029: ldloca.s 04 IL_002B: call System.Collections.Generic.Dictionary<System.Int32,System.Int32>+Enumerator.MoveNext IL_0030: brtrue.s IL_001D IL_0032: leave.s IL_0043 IL_0034: ldloca.s 04 IL_0036: constrained. System.Collections.Generic.Dictionary<,>.Enumerator ... not important

Here we can see a call to System.Collections.Generic.Dictionary<System.Int32,System.Int32>+Enumerator.get_Current to get the current Enumerator and on every iteration we have a call to System.Collections.Generic.Dictionary<System.Int32,System.Int32>+Enumerator.MoveNext to move the Enumerator forward.

Let us compare that with the IL for the case of an IDictionary<int, int> :

... not important IL_0013: ldloc.0 // dic IL_0014: callvirt System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<System.Int32,System.Int32>>.GetEnumerator IL_0019: stloc.s 04 IL_001B: br.s IL_0029 IL_001D: ldloc.s 04 IL_001F: callvirt System.Collections.Generic.IEnumerator<System.Collections.Generic.KeyValuePair<System.Int32,System.Int32>>.get_Current IL_0024: stloc.s 05 // item ... IL_0029: ldloc.s 04 IL_002B: callvirt System.Collections.IEnumerator.MoveNext IL_0030: brtrue.s IL_001D IL_0032: leave.s IL_0041 ... not important

Very interesting! This time we have a callvirt to System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<System.Int32,System.Int32>>.GetEnumerator followed by more callvirt to System.Collections.Generic.IEnumerator .

Aha! Allow me to explain in case it is not that obvious.

Virtual Call / Interface Dispatch

The first issue is the Interface Dispatch (the parts that do a non-inlineable callvirt to System.Collections.Generic.IEnumerator ) which is in contrast to the case of Dictionary<int, int> where we are doing fast, non-virtual and inlineable calls.

For more info on the cost of Virtual calls and Interface Dispatch, have a look at Writing Faster Managed Code: Know What Things Cost.

Boxing

The second issue and by far the most significant performance hit is due to boxing. Instead of returning a Dictionary<int, int>.Enumerator , we are boxing the struct and return an IEnumerator<KeyValuePair<int, int>> which explains the high number of Collections across our heap.

How about IList<T> , ICollection<T> & IEnumerable<T> vs List<T> ?

IList<T> implements ICollection<T> which in turn implements IEnumerable<T> and therefore using any of those interfaces for the enumeration incurs the similar performance cost we covered above.

This is even more clear if you look at the source of List<T> :

public Enumerator GetEnumerator() { return new Enumerator(this); } IEnumerator<T> IEnumerable<T>.GetEnumerator() { return new Enumerator(this); } System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { return new Enumerator(this); }

You can see in the last two GetEnumerator() s the Enumerator struct is being boxed to a generic an a non generic IEnumerator .

Summary