In part one, Leonid Ganeline, of IT Adapter, introduced the concept of big memory and discussed why it is so hard to deal with in a .NET environment. In part two, Dmitriy Khmaladze, of IT Adapter, describes their solution NFX Pile; a hybrid memory manager written in C# with 100% managed code.

I could not let the problem of garbage collecting pauses go. Something was vexing me; the task I had in one of our projects was a huge graph of users and addresses along with social messages - all of that stuff had to be traversed. A friend of mine told me LinkedIn used to have a C++ app where the whole blob of a social net was taken into RAM and kept there for months… hm... I just could not let it go. We started doing many experiments.

After pondering for a few days, we laid out some different techniques outlined in the prior article. None of them looked good to us.

Related Sponsored Content 3 Common Pitfalls in Microservice Integration – And How to Avoid Them

DTOs suck. They create code you don’t need, chaff. I already have my objects nice and clean: Customer, Address, Link, and Message. Why would I duplicate them into structs? And then there are the references between Customer, Address, and Messages to deal with.

Pools would not have helped; I need to store hundreds of millions of entries. GC would just start pausing for five, sometimes ten, seconds every time. This was unacceptable.

So we started doing the serialization into byte[] method. Yes, I know, slow. But is it?

This was a true “jump off the cliff” experiment. Who could estimate the practical benefit without creating a true memory manager first? So we did. We did a memory manager a la the one in C or C++ with free blocks and chunk headers. 100% managed code. No IntPtr style unmanaged pointers, just an Int32 handle we call a PilePointer. We used our Slim serializer which is akin to Microsoft BinaryFormatter only 5 times faster with 10x smaller payloads.

The end result was really good. We first attained around 50,000 inserts/sec on a single thread while reading around 100,000 ops/sec on a single thread. Then we spent time working on multi-threading synchronization with better allocation strategies and everything started to perform really well.

The key piece, of course, is the Slim Serializer. It is a non-versioning, CLR-only, packing native serializer based on a dynamic expression tree generation. The performance is close to ProtoBuf-net, however Slim does not impose any limits on payload as long as it is serializable (ISerializable). The OnSer/Deser attribute family are supported as well.

We called the memory manager the “Pile” to differentiate from “heap” but retain a similar meaning. So, “The Pile” is a memory manager that brings an unmanaged memory flavor to the managed memory paradigm. As tautological as it may sound, the thing turned out to have a very cool set of practical benefits.

IPile Interface

IPile Implementation

The Pile is 100% managed code. It works as follows:

A user needs to save a CLR object graph: var ptr = pile.Put(myGraph). The graph may contain cycles, complex objects like Dictionary<string, ConcurrentDictionary<...>>… you get the picture. Anything [Serializable] per .NET definition should be Pile-able! On a 6 core machine with 3 Ghz/64GB we were able to put 1.5 million simple objects (with 8 fields) every second! Pile serializes the payload using an intricate type-compression scheme, so if a CLR object takes 50 bytes, Pile serializer produces 20 bytes. Then it finds a segment. Segments are byte[] chunks of at least 64 MB up to 2 GB in size. It managed memory within a chunk with free lists and scans when needed. It fits the serialized data into the segment at Address (which is just an index in byte[]). This way we get a PilePointer = struct{int Segment, int Address}. It is a struct, not a class. This is VERY important. You can “dereference” the PilePointer by: var myData = pile.Get(ptr). The pointer returned by Put() is a form of “object ID”. If the pointer is invalid you get “Pile access violation” either the segment is not found or address is out of segment bounds OR the byte position does not start with the proper chunk prefix, so there are some protection mechanisms. These are not 100% perfect, as theoretically one could dereference the freed pointer, BUT this is exactly what we wanted… the unmanaged flavor! I can dereference anywhere from 1-2 MB objects per second this way. You can do a deletion to free pile chunk: pile.Remove(ptr). The memory would be released so it can get reused. When segments become empty, they get truncated - the whole byte[] gets “forgotten”. The GC can deallocate this in less than 10ms. YES!!!! While Pile works, even under the heaviest load (having 60 GB allocated) the complete GC is usually under 30 ms. Most of the time it is closer to 10-15 ms, and in spite of the fact we are churning out hundreds of thousands of objects a second, (put/get) the GC is still very fast - this is because those CLR objects are all in Gen0 - they live for <1ms and are forgotten. The resident data copy is now stored in a pile not touched by GC. Instead of seeing 500,000,000 resident objects, GC sees 200 resident objects (segments) and 1,000,000 gen0 that die instantly!

This is thread-safe of course and also has a side-benefit for caching: since Pile makes a new instance on every Get(), you can safely give-out a copy of object for requesting thread, even if they compete for the same object (same PilePointer). This is like an “actor model” in functional languages when a worker takes the order, mutates it into something new then passes along the chain - no need to lock().

Internally, Pile is 100% thread safe for Put()/Get()/Remove() operations. The implementation is based on some clever ref swaps and custom-designed spin waits instead of lock(), as the latter is too slow for performance sensitive operations.

Test Results

We have constructed a test app, and tried to do similar work in both “native object format” and Pile.

An important note to keep in mind: When you compare the test results from below against, say Redis or Memcached, ask yourself a question: am I comparing apples to apples? Does your API give you back a .NET object or a string you need to parse somehow to access “fields”? If it gives you a string or some other form of “raw” data, please do not forget to account for serialization time to convert the raw data into a usable .NET object. In other words, a string ‘{“name”: ”Frank Drebin”, “salary”: “1000000”}’ does not allow you to access “salary” without parsing. Pile returns you a real CLR object - nothing to parse, so the ser/deser time is already accounted for in the numbers below. When Pile stores byte[] or strings, its performance needs to be multiplied at least five, if not ten, times.

One Intel Core 3x I7 Sandy Bridge 3 GHz 64 GB RAM.

Test object: 7 fields

Native .NET+GC on Windows, GC is in the “Server” mode, 1 thread:

Object size: around 144 bytes

Stable operation with <10 million objects in RAM before slowdowns start

Speed deteriorates (SpeeDet) after: 25 million objects added

Allocated memory at SpeeDet: 10 GB

Writes: average 0.8 million objects / sec (starts at millions+/sec then slows down significantly after SpeeDet), interrupted by GC pauses

Reads while writing: >1 million objects / sec interrupted by GC pauses

Garbage Collection stop-all pauses near SpeeDet: 1,000 - 2,500 ms (2,000 ms average)

Full Garbage Collection near SpeeDet: 1,800 - 5,900 ms

Pile on Windows, 1 thread:

Here is source code for the test application.

Object size: around 75 bytes

Stable operation with >1,000 million serialized objects in RAM (yes, ONE BILLION objects)

Slow down after: 600 million objects (as they stop fitting in 64 GB physical RAM)

Pile memory: 75 GB

Allocated memory: 84 GB

Writes: 0.5 million objects / sec without interruptions

Reads while writing: 0.7 million objects / sec without interruptions

Garbage Collector stop-all pauses at 600M objects: none

Full Garbage Collection: stable time of <30 ms (15 ms average)

How is This Possible?

At first it may seem strange that our Pile solution is as fast or faster than the native GC, especially when you factor in the cost for serialization.

But when you think about it, it really isn’t that unexpected. The native GC approach with native objects and no serialization is 1000s times faster when you use a simple benchmark. But once you overload the GC with millions of the resident objects and keep allocating, the process starts to pause. That is why the seemingly impossible test result is possible in practical scenario. GC just can’t handle tens of millions of objects if they stick around.

Like I have described above, there are several technical achievements in NFX framework that move this solution from Sci-Fi to reality:

Very fast serializer (NFX.Slim) . It is optimized for the bulk serialization-deserialization. It is not slowing down on a big number of objects and is very efficient in packaging serialized data.

. It is optimized for the bulk serialization-deserialization. It is not slowing down on a big number of objects and is very efficient in packaging serialized data. The serializer can work with the .NET classes without any additional class treatment . We don’t have to describe classes in IDL. To send any object to Pile we don’t have to do additional boilerplate tasks, we send .NET serializable objects directly. This is cheap in terms of developer labor with business objects.

. We don’t have to describe classes in IDL. To send any object to Pile we don’t have to do additional boilerplate tasks, we send .NET serializable objects directly. This is cheap in terms of developer labor with business objects. The serializer can work with any .NET classes and with polymorphic classes of any complexity. Only Microsoft BinaryFormatter can do such things, as we know. Unfortunately BinaryFormatter is very slow and inefficient in packing serialized data. NFX.Slim also does type registry stateful compression - it remembers the types written/read. Instead of emitting the type name into the stream every time, it writes the number from the pool.

Cache

After we achieved some very cool performance results with Pile it was a time for cache. After all, a Pile as-is is not enough to “look up” data by key.

A Big Memory Pile Cache is abstracted into the interface that supports priority, maximum age, absolute expiration timestamps and memory limits. When you approach a limit, the object starts to get overwritten if their priorities allow.

The cache evicts old data (auto-delete) and expires objects at a certain timestamp if it was set. It also clones detailed table settings from a cache-wide setup where everything is configurable (index size, LWM, HWM, grow/shrink percentages etc.).

The cache is described in the linked video.

Test Screens

Here you have it. 1+ billion objects are allocated on a Pile. The write throughput is now around a paltry 300,000 inserts a second from 10 threads. This is because we have also allocated 80+GB on a 64 GB machine. Actually, I was surprised swapping did not kill it altogether, it still works, AND full-scan GC is 11 ms! This machine is 3 Ghz 6 core i7 with 64GB physical running Windows 7 64bit.

(Click on the image to enlarge it)

And here we are on Ubuntu Linux under Mono, having 300M objects taking around 27 GB of RAM. The machine is i7 4 core 3 Ghz 32 GB physical. The 8 threads are still inserting at 500K/sec.

(Click on the image to enlarge it)

After I purge the Pile notice the GC deallocating 28 GB in 72 ms (shown in window title):

(Click on the image to enlarge it)

And this is how it looks on a Windows box:

Links

NFX Source

Big Memory Pile

Big Memory Cache

Serializer Benchmark Suite (NFX.Slim serializer among others):

Source

Test results

Test result table: Typical Person

Test result charts: Typical Person

NFX Pile Video - How to store 300,000,000 objects in CLR/.NET

NFX Pile - .NET Managed memory Manager Video

NFX Pile Cache Video

About the Authors

Dmitriy Khmaladze has over 20 years of IT experience in the US. Startups and Fortune 500 clients; Galaxy Hosted, Pioneered SaaS for medical industry in 1998; 15+years research: language and compiler design, distributed architecture; System programming and architecture, C/C++,.NET, Java, Android, IOS, Web design, HTML5, CSS, JavaScript, RDBMSs and NoSQL/NewSQL.

Leonid Ganeline has 15 years as Integration Developer; Microsoft MVP Awards in Integration; blogger. He enjoys all about software, traveling, reading, running.