Optimizing memory usage Monday, December 4, 2017

In this post I’ll discuss some memory optimization work I’ve done recently on the LLBLGen Pro runtime framework, v5.3.2. This is a popular (commercial) .NET ORM. LLBLGen Pro is on the market since 2003 and has seen a lot of refactoring work internally over the years, among them performance optimizations and memory usage optimizations. With new features come new challenges to make the overall framework perform the same or even faster and still perform the new features as well, so optimizing the core engine is a recurring effort.

The optimizations discussed here form just a small subset of the massive amount of things you can do to optimize memory usage, but it could give some insights in how to get started and what to look for in your own code. As with all optimization work, it depends on the situation at hand and why that particular piece of code is less optimal so you can fix it properly.

Before we can do any optimization however, we first have to define a couple of things so we don’t do any work unnecessary or even harm the code we want to make more optimal. I won’t go into fine details about the differences between the various heaps, as that’s besides the scope of the article.

A word of warning though: pre-mature optimization is bad. The things discussed here are solutions to problems I recognized. They weren’t applied because ‘they are faster, regardless’; they’re not rules you should live by. Sure, if you have the choice between two equal solutions, just pick the one which is more efficient. Performance optimization isn’t free: there’s always another thing you can optimize, and all it takes is a compromise elsewhere. Is that price worth paying? Up front you don’t know that, hence optimizing right from the start isn’t going to be beneficial in many cases: write clean, simple code that’s easy to maintain and understand. If there’s a slightly faster variant thinkable that’s way more complex and harder to understand, chances are it’s not worth it, and you only know after the fact, never before the fact: the code might be called just once, so optimizing it away is useless.

Why optimize memory usage and when?

.NET is a garbage-collector (GC) based platform and that brings great joys because you don’t have to care about memory allocation and management: the GC cleans up after you’ve trashed the room and in theory you never have to worry about any consequences. While trashing rooms as a wanna-be rock star might feel exhilarating, there’s no free lunch: not in hotels, nor inside your computer beneath the .NET CLR and its digital vacuum cleaner; something has to pick up the tab, and in the context of this article, it’s the GC.

Allocating memory has a cost, albeit in general a small one: it takes a bit of time to allocate some memory, initialize it and make it look like that fancy StringBuilder you just new-ed up. When you add some strings to that StringBuilder and its internal buffer is too small, it has to allocate a new block of memory, copy over the original contents and free the old block. Those actions take time. Not ‘copy this large file to a usb-stick’-time, but doing it a lot of times will be noticeable.

Besides the work the GC has to do for cleaning up the old buffer, the whole allocating/copying/freeing dance has another side effect: memory fragmentation. The GC/CLR compacts memory when it gets the chance. The main reason is that if your program needs to have a large block of continuous memory, chances are it’s available, and not fragmented over a million already deallocated buffers. In the latter case you’d get an out-of-memory exception as the block you’re requesting isn’t there.

So in short: all that memory allocating work can be a serious burden for the overall performance of your software. The less memory your application allocates, the better: what isn’t allocated doesn’t need to be cleaned up / copied / packed and taken care of. The problem of course is: with a system where memory management is taken care of behind your back, you’re not focused on it when writing code and reading a piece of code won’t immediately reveal where things are less optimal.

Profile, Analyze, Fix

You can’t say anything about whether software is slow or less optimal unless you’ve measured it first. This is done using a profiler. A profiler is a tool which measures performance, memory usage and the like of software and gives you fancy reports about what code was executed, how slow each statement was and what memory was used. There are several profilers for .NET, e.g. Jetbrains’ dotMemory and dotTrace, Red Gate’s Ants and e.g. the visual studio performance analyzer. Each has their strong points and weaknesses, and some offer some features others lack, it really depends on what you’re after. For this work I’ve used the visual studio performance analyzer, but in general what matters is: at least use a profiler. Make it a part of your every day tool-chain. If you’ve never used a profiler before, make it your main task tomorrow and run one over your code. Just to see what it digs up. Changes are you’ll be surprised what’s really going on inside your software.

With the profiler ready to be used, you have to define a set of tests which you want to measure. For this article I’ll be looking at some fetch benchmarks I’ve performed recently, which you can see in my previous post “.NET (Micro)ORM fetch benchmark results and the fine details”. It matters to have a context for your tests: if your measured code does things in N seconds and consumes M megabytes of memory, how can you say that’s an OK result or a very poor result? If you have comparable systems / pieces of code available, these can give context: in the light of my benchmarks: the hand-written, hand-optimized code is a good example how optimal things can be.

When examining the results a profiler gives you, you have to compare results with what’s realistically optimal. Take the hand-optimized benchmark: it’s very fast, but offers no other features as well. For instance there’s no way that code does authorization. If your code is slower or consumes more memory than that fastest variant, is that because you simply have more features and those take some performance, or is it because you made some sloppy mistakes and can do better? In short: the code you wrote, is it acceptable it’s that slow and eats that much memory? If comparable systems all do, maybe even more than your code, it might very well be your code is near optimal for what it does: you might optimize this code just for fun, but it’s likely you won’t find much to work with.

After you’ve found areas which are less optimal, you have to analyze what the real problem is: do you do unnecessary work, do you allocate a lot of objects you don’t need, or e.g. did you use code that was slammed together without any thought about performance and memory usage? Just three examples out of many which all need their own unique approach how to fix them. It might also be there’s no real fix for the code you have or it would require a complete rewrite of a lot of subsystems. That’s OK: at least you measured it, you analyzed it, and you made an effort to make it better but the costs are too high right now.

When to optimize?

People who just start using a profiler will often try to optimize everything, but that’s a mistake. Doing things with a computer takes time, it does use memory, it will not be ready in an instant. In general I use the rule that optimizing everything that consumes less than 10% of the overall time or less than 10% of the overall memory used by the test is likely a waste of time: even if you optimize it completely away, you gain at most 10%. Chances are if you are lucky, you’ll get from 10% to 7% or maybe even 5%, so you gained 5%. That’s not a big win. Instead first look at areas which take 40%-50% or even more and see what takes all that.

Visual Studio Memory Profiler

For the particular work we’ll be doing here, it’s key to have the right tooling. I’ve used the visual studio .NET memory allocation profiler for that. It offers a call chain with per method the bytes allocated. Most other profilers don’t offer this particular view, as they’re mainly focused on what memory is currently allocated at time T and what code allocated those objects, but it’s crucial for the measurement of which code is allocation more memory than it should. To learn more about this profiler, please see this excellent article from Stephen Taub: .NET Memory Allocation Profiling with Visual Studio 2012. Although it’s a few years old, it’s applicable to the current visual studio versions as well. For this work I’ve used the Call graph view and included the percentage columns (not visible by default) so you get a good overview what to ignore and what to focus on.

With that knowledge under our belt, let’s check out some optimizations I did and why.

Optimizing the Linq provider

If you look at the ‘Individual fetches, non-change tracking’ graph in the benchmark article, you’ll see that the Linq provider of LLBLGen Pro does consume more memory than all the others (the red line). Processing Linq expression trees is a complex affair, and doing things in the most optimal way is often not the first concern when writing a full Linq provider. Still, doing things inefficiently is of course something we should avoid.

Using the .NET memory allocation profiling feature of the performance analyzer revealed a couple of things. I’ve described them below.

Avoid params arrays

The Linq provider calls in every method at the start and end a method which traces the method start and method end. This method ends up in this simple tracer helper method:

1: public static void WriteLineIf( bool condition, string messageInFormat, string category, params object [] argumentsForMessage) 2: { 3: if (condition) 4: { 5: Trace.WriteLine( string .Format(messageInFormat, argumentsForMessage), category); 6: } 7: }

The problem here is that with every call the argumentsForMessage array is allocated (and destroyed right after the call by the GC). Calling this method many times (and in a Linq provider, the visitors’ methods are called a lot) will make the CLR allocate a lot of object arrays which are then destroyed right after, causing potentially memory fragmentation. We can avoid these by simply making sure we hit an overload which doesn’t allocate the object arrays but accepts a fixed number of arguments. For this particular hot path this was ideal, as the method start / end trace messages were effectively simple strings and didn’t need string formatting.

Avoid StringBuilder.AppendFormat()

Inside the Linq provider it’s sometimes necessary to define a ‘key’ for a subtree or expression: You can then e.g. assign an alias on that subtree or expression. This is important as expression trees get transformed into different trees along the way, and when doing so you lose the object references you had as the expressions in an expression tree are immutable. So I calculate keys based on strings: I use visitors traversing the sub tree which along the way build a string from what they encounter, based on types and values. I use code like this:

1: internal static void AppendExpressionFragment(StringBuilder toAppendTo, LinqExpression expression) 2: { 3: if (expression == null ) 4: { 5: return ; 6: } 7: toAppendTo.AppendFormat( "NT{0}|T{1}|" , ( int )expression.NodeType, expression.Type.GetHashCode()); 8: }

This uses an AppendFormat, but it’s actually unnecessary. More efficient is the following:

1: internal static void AppendExpressionFragment(StringBuilder toAppendTo, LinqExpression expression) 2: { 3: if (expression == null ) 4: { 5: return ; 6: } 7: toAppendTo.Append( "NT" ); 8: toAppendTo.Append(( int )expression.NodeType); 9: toAppendTo.Append( "|T" ); 10: toAppendTo.Append(expression.Type.GetHashCode()); 11: toAppendTo.Append( "|" ); 12: }

This is a simple change, but has a big benefit. All allocations needed for the format handling are avoided.

Pre-size data-structures if you add data immediately afterwards

Consider the following visitor handler:

1: protected virtual IEnumerable<ElementInit> HandleElementInitializerList(ReadOnlyCollection<ElementInit> listToHandle) 2: { 3: TraceScopeStart( "GenericExpressionHandler::ElementInitializersList" ); 4: List<ElementInit> newList = new List<ElementInit>(); 5: bool changedElementInitializerSeen = false ; 6: for ( int i = 0; i < listToHandle.Count; i++) 7: { 8: ElementInit currentInitializer = HandleElementInitializer(listToHandle[i]); 9: newList.Add(currentInitializer); 10: changedElementInitializerSeen |= (currentInitializer != listToHandle[i]); 11: } 12: TraceScopeEnd( "GenericExpressionHandler::ElementInitializersList" ); 13: return changedElementInitializerSeen ? (IEnumerable<ElementInit>)newList : listToHandle; 14: }

A new List<T> is initialized with length 0. This means it doesn’t allocate any memory for its buffer. If you add a single element, it has to allocate a buffer. But it doesn’t allocate a large buffer initially. When adding the initial element, it will allocate an array of 4 items. If you add 5 elements, it thus has to throw away the array of 4 items, allocate one which can contain at least 5 (it uses a scale for that, see the List<T> source code) and copy over the 4 items already in the old array.

In the method above, we already know how many items we’ll add, namely listToHandle.Count. We can initialize the List<T> at line 4 with this number and make sure List<T> won’t re-allocate memory during the for loop. The careful reader can spot another potential optimization that’s not implemented here, as it would make the code more complex, do you see it (hint it relates to the next topic) ?

Initialize members lazily

In a Linq provider you track a lot of things along the way, e.g. which elements were aliased, which scopes are currently active and many more things. These are all stored in datastructures like dictionaries, lists and stacks. I use a class for that called MappingTracker, which tracks the collected data in its internal structures. Every visitor receives the general MappingTracker instance or if it’s a temporary visitor, a new instance. Being a good sport, I simply created new instances of these internal structures in the constructor, to avoid null checks and creations all over the place. However the price for this turned out to be higher than expected: every time the Linq provider creates an expression key (as shown above) it created a new tracker and a new set of these internal structures.

To fix this, simply initialize the structures lazily. I’ll give two examples. The first is the way which allocates memory for the datastructure regardless if it’s used or not. The second initializes it lazily.

Old:

1: public class MappingTracker 2: { 3: private readonly HashSet<SetAlias> _createdAliases; 4: //... 5: 6: public MappingTracker() 7: { 8: _createdAliases = new HashSet<SetAlias>(); 9: //... 10: } 11: 12: public SetAlias CreateNewAlias() 13: { 14: _aliasCounter++; 15: SetAlias toReturn = new SetAlias( "LPLA_" + _aliasCounter); 16: _createdAliases.Add(toReturn); 17: return toReturn; 18: } 19: 20: //... 21:

New:

1: public class MappingTracker 2: { 3: private HashSet<SetAlias> _createdAliases; 4: //... 5: 6: public MappingTracker() 7: { 8: } 9: 10: public SetAlias CreateNewAlias() 11: { 12: _aliasCounter++; 13: SetAlias toReturn = new SetAlias( "LPLA_" + _aliasCounter); 14: this .CreatedAliases.Add(toReturn); 15: return toReturn; 16: } 17: 18: //... 19: 20: private HashSet<SetAlias> CreatedAliases 21: { 22: get { return _createdAliases ?? (_createdAliases = new HashSet<SetAlias>()); } 23: }

The access to the member is now through a private property which allocates when needed (only the first time). If no call is made to this particular method, no HashSet<SetAlias> is created. With a good refactoring tool like Resharper, this change is easy and painless.

Avoid Linq-to-Objects if necessary

It turns out using .Any() or .FirstOrDefault() on e.g. a HashSet<T> isn’t free from allocations. If this method is on a hot path as well, chances are the allocations are seriously noticeable. Consider the following code:

1: HashSet<SetAlias> aliases= new HashSet<SetAlias>(); 2: if (! string .IsNullOrEmpty(expressionKey)) 3: { 4: SetAlias keyAlias = null ; 5: if (_aliasPerExpressionKey.TryGetValue(expressionKey, out keyAlias)) 6: { 7: aliases.Add(keyAlias); 8: } 9: } 10: if (!aliases.Any()) 11: { 12: if (!_aliasesPerMemberInfo.TryGetValue(info, out aliases)) 13: { 14: aliases = new HashSet<SetAlias>(); 15: } 16: } 17: toReturn = aliases.FirstOrDefault(a => CheckIfAliasIsInScope(a));

Refactoring it to the following solves allocations and is faster. As it’s a hot path method I went for this refactoring, as it paid off, but in general you might think this is a bridge too far because it’s more code and potentially more complex.

1: HashSet<SetAlias> aliases = null ; 2: if (! string .IsNullOrEmpty(expressionKey)) 3: { 4: SetAlias keyAlias = null ; 5: if ( this .AliasPerExpressionKey.TryGetValue(expressionKey, out keyAlias)) 6: { 7: aliases = new HashSet<SetAlias> { keyAlias }; 8: } 9: } 10: if (aliases== null || aliases.Count <=0) 11: { 12: this .AliasesPerMemberInfo.TryGetValue(info, out aliases); 13: } 14: if (aliases != null ) 15: { 16: toReturn = null ; 17: foreach (var alias in aliases) 18: { 19: if (CheckIfAliasIsInScope(alias)) 20: { 21: toReturn = alias; 22: break ; 23: } 24: } 25: } 26: if (aliases!= null && toReturn== null )

When arriving here, I cut down the 550KB per query to 320KB or so. Still not the result I’d have hoped, but it was a lot less than where I started with.

There was more to find of course, namely in the SQL generator.

Optimizing the SQL generator

In an ORM, generating SQL strings is one of things it is meant to do, and if you’re not careful it will generate a lot of strings. Not only the final SQL query string, but also name fragments, concatenated parameter names and other things. I’ve optimized this system over the years as it’s one of the main time sinks of a query pipeline and a major memory waster. Some time ago I refactored the query pipeline to use a list of fragments, which of themselves could be list of fragments. This was abstracted away into a QueryFragments class and some minor classes for e.g. a delimited list of fragments. Internally these classes simply collected fragments like a name of a field, and by building these fragment structures during the SQL generation you could build the query decentralized and without concatenating strings all over the place, that was simply done at the end: by flattening everything to a string.

This looks like this

1: var fragments = new QueryFragments(); 2: fragments.AddFragment( "SELECT" ); 3: StringPlaceHolder distinctPlaceholder = fragments.AddPlaceHolder(); 4: StringPlaceHolder topPlaceholder = fragments.AddPlaceHolder(); 5: QueryFragments projection = fragments.AddCommaDelimitedQueryFragments( false , selectList?.Length ?? 0); 6: QueryFragments fromClause = fragments.AddQueryFragments();

After this code, methods then could add fragments to the ‘fromClause’ or ‘projection’ without worrying it would insert strings into another string, neither does the code have to be in a given order to construct the string, it can be in any order: the code just has to make sure it adds its fragments to the right fragment container.

You can’t always avoid creating temporary strings along the way. For these situations I used StringBuilders with an estimated size. In v5.3 I optimized this further with a StringBuilderCache, which was a borrowed class from CoreFX (from System.IO). The idea is simple: you ask the cache for its cached StringBuilder, which already has a pre-allocated size. If it’s available you return it and un-cache it (it has just 1 cached instance). If it’s not available you simply allocate a new one. When you’re done you add it back to the cache. If there’s already one cached but it’s smaller, you toss the old one away and cache the one with the bigger internal buffer.

Profiling this code using the memory profiler I still saw a lot of StringBuilders being created and more importantly: being resized. So I rewrote the StringBuilderCache to keep 4 around instead of 1. The code can be found in this gist. If now during the flattening of a hierarchy of QueryFragments multiple StringBuilder instances are at play, it will avoid reallocation of buffers and allocating more StringBuilders. So how to use this cache? The below snippet shows how:

1: StringBuilder sb = StringBuilderCache.Acquire(originalSqlString.Length + this .QueryTag.Length + 8); 2: sb.Append( "/* " ); 3: sb.Append( this .QueryTag); 4: sb.Append( " */ " ); 5: sb.Append(originalSqlString); 6: this .Command.CommandText = StringBuilderCache.GetStringAndRelease(sb);

Here it first acquires a new StringBuilder, or if one is already present with the requested size, it gets one from the cache. It then appends some strings, and on the last line it gets the string from the string builder and releases it back to the cache.

Using multiple instances in the cache also means in recursive scenario’s like in this situation it will still have potentially a cached builder around. For my situation 4 turned out to be more than enough.

End results

After a week working on this, I got the single element fetches which use Linq down from 1.54ms to 0.95ms and from 553KB per element to 280KB per element. I think that’s a lot of progress Additionally, the optimizations in the SQL engine also benefited the LLBLGen Pro QuerySpec query pipeline (QuerySpec is LLBLGen Pro’s fluent DSL for building queries in C#/VB.NET), where memory usage went down from 121KB to 80KB per element and performance went from 0.65ms to 0.59ms.

All in all very happy with the results of a relative small amount of changes. These changes are available in LLBLGen Pro v5.3.2, currently available as hotfix build.

Happy optimizing!