Recently Marc blogged about some performance optimisations we implemented at Stack Overflow to work around some Garbage Collection (GC) issues.

This post evoked many emotions and opinions both on his Blog and on Hacker News. In this post I would like to cover some of the history involved with this particular problem, explain how we found it and eventually fixed it.

Jeff has a stopwatch in his brain.

When Stack Exchange developers look at our sites we get a very different view. In particular, on every page we visit there is a little red box that shows us how long it took the page to render, and other goodies. We run MiniProfiler in production, performance is a feature.

One day Jeff started noticing something weird:

Once in a while I visit a page and it takes a whole second to render, but MiniProfiler is telling me it only took 30ms. What is going on?

If the problem is not measured locally, it must be a network problem or an HAProxy we deduced. We were wrong.

I responded: “try editing your hosts file, perhaps there is a DNS issue”. This fixed nothing.

In hindsight we realised this only happened 1 out of 50 or so times he visited the page, which correlates with my stopwatch theory.

But, I am jumping a bit ahead of myself here, let me tell you a bit of history first.

The tag engine of doom

I love SQL. I am pretty good at it. I know how to think in sets and pride myself in some of the creative solutions I come up with.

One day, I discovered that we had a problem that simply can not be solved efficiently in SQL. Square peg, round hole.

We allow people to cut through our list of 2.2 million questions in 6 different sort orders. That is a problem that is pretty trivial to solve in SQL. We also present users with an aggregation of related tags on each of our sorts, something that is also reasonably simple to denormalize and cache in SQL.

However, stuff gets tricky when you start allowing rich filtering. For example if you are interested in SQL performance questions you can easily filter down the list and maintain all your sort orders.

Historically we found that the most efficient way for SQL Server to deal with these queries was using the full text search (FTS). We tried many different techniques, but nothing came close to the full text engine.

We would run queries such as this to get lists of tags to build a related tag list:

SELECT Tags FROM Posts p WHERE p.PostTypeId = 1 AND p.DeletionDate IS NULL AND p.ClosedDate IS NULL AND p.LockedDate IS NULL AND CONTAINS(Tags, 'Ã©sqlÃ AND Ã©performanceÃ ')

A similar query was used to grab question ids.

There were a series of problems with this approach:

It did not scale out, it placed a significant load on the DB. It caused us to ferry a large amount of information from the DB to the app servers on a very regular basis. SQL Server FTS does not play nice with existing indexes. The FTS portion of a query always needs to grab the whole result set it needs to work on, prior to filtering or ordering based on other non FTS indexes. So, a query asking for the top 10 c# .net questions was forced to work through 50k results just to grab 10.

To solve these issues we created the so called tag engine. The concept is pretty simple. Each web server maintains a shell of every question in the site, with information about the tags it has, creation date and other fields required for sorting. This list is updated every minute. To avoid needing to scan through the whole list we maintain various pre-sorted indexes in memory. We also pre-calculate lists of questions per tag.

We layer on top of this a cache and a bunch of smart algorithms to avoid huge scans in memory and ended up with a very fast method for grabbing question lists.

And by fast, I mean fast. The tag engine typically serves out queries at about 1ms per query. Our questions list page runs at an average of about 35ms per page render, over hundreds of thousands of runs a day.

Prior to the introduction of the tag engine the question listing pages on Stack Overflow were the single biggest bottleneck on the database. This was getting worse over the months and started impacting general unrelated queries.

There was another twist, the tag engine was always very close to “lock free”. When updating the various structures we created copies of the engine in-memory and swapped to the updated copy, discarding the out-of-date copy.

The worst possible abuse of the garbage collector

The .NET garbage collector like Java’s is a generational garbage collector. It is incredibly efficient at dealing with objects that have a short life span (provided they are less than 85000 bytes in size). During an objects life it is promoted up the managed heaps from generation 0 to 1 and finally to generation 2.

The .NET garbage collector runs generation 0 sweeps most frequently, generation 1 sweeps less frequently and generation 2 sweeps least frequently. When the Server GC in .NET 4.0 runs a generation 2 GC (which is a full GC) it has to suspend all managed threads. I strongly recommend reading Rico’s excellent Garbage Collector Basics and Performance Hints. .NET 4.5 will improve this situation, however, even with the background GC in .NET 4.5 in certain cases the GC may suspend threads, a fact that is confirmed by Maoni, the chief GC architect, on her Channel 9 talk.

If you happen to have a data structure that is full of internal references that allocates a huge amount of objects that just make it into generation 2 when you abandon them, you are in a pickle.

Your blocking GC is now forced to do lots of work finding the roots of the objects. Then it needs to deallocated them, and possibly defragment the heap.

Our initial tag engine design was a GC nightmare. The worst possible type memory allocation. Lots of objects with lots of references that lived just enough time to get into generation 2 (which was collected every couple of minutes on our web servers).

Interestingly, the worst possible scenario for the GC is quite common in our environment. We quite often cache information for 1 to 5 minutes.

Diagnosing the problem

This issue was first detected by Kyle in July, we performed a few optimisations assuming this was CPU related and heavily reduced CPU usage on the tag engine as a side effect we also reduced memory churn which reduced the impact a bit. At the time we did not assume we had a GC issue, instead we assumed there was a rogue thread slowing everything down.

Fast forward to about 3 weeks I enabled some basic profiling on our web servers that I had baked in a while ago:

This report is based on a simple Stopwatch that starts at the beginning of a request and finished at the end. It measures the web servers request duration and puts it into neat time buckets partitioned by controller action.

This report showed me something rather odd. The time the web server was reporting was wildly different to the time we were measuring on our load balancer. It was about 60ms off on average.

I knew this cause of Jarrod’s awesome work. We store all our HAProxy web logs in a SQL Server instance. This has been a godsend when it comes to data analysis. I ran a simple average on the time HAProxy observed and noticed the discrepancy right away.

When I graphed the response time, using free and open source Data Explorer, for any route on our site we noticed spikes.

The above graph demonstrates how a limited number of requests would “spike” in response time many of which took longer than 1 second.

The frequency of the spikes correlated to the Gen 2 Collections performance counter.

Equipped with this feedback we set out on a mission to decrease the spikes which today look like this (for the exact same number of requests).

These spikes had a very interesting characteristics. We now store time measured in ASP.NET as well in our logs, so can graph that as well.

ASP.NET did not even notice that the spikes happen. This may indicate that ASP.NET is aware a GC is about to happen and is careful to serve out all its existing requests prior to running a Gen 2 GC. Regardless, the fact MiniProfiler could not catch the spikes correlated with our empirical evidence.

Note the blue “spikes” in the graph above are caused by a process that rebuilds related question lists. In general this is done in a background thread, however in rare conditions it may happen on first render. The GC spikes are the areas in the graph where there is a pile of yellow dots on top of a blue line. Clearly, it is not that problematic in this graph, it was created on data collected after we heavily improved the situation.

###.NET Memory Profiler

I have always been a huge fan of .NET Memory Profiler. It is an awesome tool. Over the years Andreas has kept up with all the latest features. For example, the latest version allows you to attach to .NET 4 application using the new ICorProfiler attach API.

During the process of optimisation, after every round of changes, I would pull a server off the load balancer and analyze the managed heaps:

For example: the profile above was taken after we refactored the tag engine. I discovered another spot with 500k objects that were alive for only a couple of minutes.

.NET Memory Profiler allows you to gather and compare snapshots of memory, so you can properly measure churn in your managed heaps. Churn in your generation 2 managed heap or large object heap is a major source of GC “freezes”.

Techniques for mitigating GC cost

When I fully understood this issue I quickly realised the solution to this problem.

Why should the tag engine cause requests that do not even consume its services to slow down. That is crazy talk. The tag engine should be isolated in its own process, then we can use a communication channel to query it and even distribute it amongst several machines issuing simultaneous queries from our webs to mitigate GC freezes. The first tag engine to respond will be served to the user. Problem solved.

When I pitched this to the team there was violent objection. Now we are going to need to maintain an API. Deploys are now going to be more complex. The list goes on. Services, SLAs and the rest of Yegge’s recommended cures for the API battle was not going to fly here.

Arguments aside, I did realise that by isolating you are only limiting the impact of the issue. You are not solving the problem. Also, there is a cost for all the new complexity this would introduce.

We also considered Microsoft’s recommendation:

If you do have a lot of latency due to heavy pause times during full garbage collections, there is a feature that was introduced in 3.5 SP1 that allows you to be notified when a full GC is about to occur. You can then redirect to another server in a cluster for example while the GC occurs.

I was not particularly keen on this suggestion:

We did not have a channel built that would allow us to coordinate this activity.

There was a non trivial risk that multiple machines would be forced to run full GC collections at the same time, leading to complex coordination efforts. (we were seeing a full GC every couple of minutes)

It does not solve the underlying problem

So Marc and I set off on a 3 week adventure to resolve the memory pressure.

The first step was rewriting the tag engine.

Index arrays as opposed to object arrays

In the tag engine we have to keep a global index of all our questions in one of six different sort orders. Trivially we used to use:

Question[] sortedByCreationDate;

The issue with this general pattern is that it is GC unfriendly. Having this array around means that each question now has one extra reference the GC is going to have to check.

Our tag engine is essentially immutable after construction or refresh. So we could simply replace the 2.2 million question list with.

int[] sortedByCreationDate;

An array containing indexes into the main immutable question list.

This is a pattern many application use internally, including Lucene.

For large objects sets that are immutable, this is a great solution. Nothing particularly controversial here.

Classes to Structs where needed

The above optimisation got us a fair distance however, we were left with a 2.2 million question list full of class pointers. The GC still needs to work real hard to crawling each Question object to deallocate when we swap in the new Question list. This was partially mitigated by reusing large chunks of our data during the refresh. However, it was not solved.

The cheapest solution from the GCs perspective was to have a simple struct array that it could allocate and deallocate in a single operation.

Structs however come with big warnings from Microsoft:

Do not define a structure unless the type has all of the following

characteristics:

It logically represents a single value, similar to primitive types

(integer, double, and so on).

It has an instance size smaller than 16 bytes.

It is immutable.

It will not have to be boxed frequently.

Migrating these millions of objects to Structs was going to violate all these rules. The classes were bigger than 16 bytes, mutable during refresh and represented multiple values.

The unspoken reality is that Microsoft violate these rules in data structures we use daily like Dictionary .

Working efficiently with structs means you need a deep understanding of the C# semantics, for example, understanding this code should be second nature if you are to use structs.

var stuff = new MyStruct[10]; var thing = stuff[0]; thing.A = 1; Console.WriteLine(stuff[0].A); // returns 0 stuff[0].A = 1; Console.WriteLine(stuff[0].A); // returns 1 DoSomething(stuff[0]); Console.WriteLine(stuff[0].A); // returns 1 (can not do anything, it got a copy) DoSomethingElse(ref stuff[0]); Console.WriteLine(stuff[0].A); // returns 2

You have to be real careful not to operate on copies of data as opposed to the real data. Assignments copy. With this in mind, when dealing with huge collections of entities you can use structs.

As a general rule: I would only consider porting to structs for cases where there is a large number (half a million or more) of medium lived objects. By medium lived, I mean, objects that are released shortly after they reach generation 2. If they live longer, you may be able to stretch it a bit. However, at some point, the sheer number of objects can be a problem, even if they are not deallocated. The days of 64 bit computing are upon us, many enterprise .NET apps consume 10s of gigs in the managed heaps. Regular 2 second freezes can wreak havoc on many applications.

Even with all the upcoming exciting changes with the GC, the reality is that doing a ton less work is always going to be cheaper than doing a lot of work. Even if the work is done in the background and does not block as much.

Huge medium to long term object stores in a generational GC require careful planning, so the same principles apply to Java (even if its GC is less blocking).

.NET and other generational garbage collected frameworks are incredibly efficient at dealing with short lived objects. I would not even consider converting any of our short lived classes to structs. It would not make one iota of a difference.

Avoiding large amounts of int arrays

On our Question object we have a tags array.

struct Question { int[] Tags {get; set;} }

Even after we got rid of these 2.2 million refs we were still stuck with a large number of pointers to arrays that live in the struct.

To eliminate this we used a custom struct that pre-allocates 10 spots for tags and provides similar array semantics. Our implementation uses fixed int data[MaxLength]; as the internal backing store. However, you could implement something safe that is identical by having N int fields on the internal embedded struct.

The fixed size structure is a trade-off, we potentially take up more space, however avoid a lot of small memory allocations and deallocations.

I also experimented with Marshal.AllocHGlobal but managing lifetime for an embedded structure was way too complicated.

The above techniques are the primitives we use to implement our system, all of this is wrapped up in cleaner object oriented APIs.

In this spirit, if I were to implement a managed version of say memcached in C# I would use huge byte[] to store data for multiple entries. Even though performance of a managed memcached will never match a c++ implementation, you could design a managed store in C# that does not stall.

Conclusion

Some of our techniques may seem radical on first read. Others a lot less so. I can not stress how important it is to have queryable web logs. Without having proper visibility and the ability to analyze progress, progress can not be made.

If you face similar problems in a high scale .NET application you may choose the large block allocation solution, you could rewrite components in c++ or isolate problem services.

Here is a graph plotting our progress over the months, it shows the percentage of times Question/Show takes longer than 100ms, 500ms or 1000ms.

The remaining delays these days are mostly GC related, however they are way shorter than they used to be. We practically erased all the 1 second delays. Our load these days is much higher than it was months ago.

I take full credit for any mistakes in this article and approach, but praise should go to Marc for implementing many of the fixes and to Kyle and Jeff for communicating the urgency and scale of the problem and to Jarrod for making our web logs queryable.

I am glad Jeff has a stopwatch in his brain.