Key Takeaways Web servers often have far more memory than the .NET GC can efficiently handle under normal circumstances.

The performance benefits of a caching server are often lost due to increased network costs.

Memory Mapped Files are often the fastest way to populate a cache after a restart.

The goal of server-side tuning is to reach the point where your outbound network connection is saturated. This is obtained by minimizing CPU, disk, and internal network usage.

By keeping object graphs in memory, you can obtain the performance benefits of a graph database without the complexity.

In continuation of the Big Memory topic on the .NET platform (part1, part2), this article describes the benefits of utilization of large data sets in-process on the managed CLR server environments using Agincore’s Big Memory Pile.

Overview

RAM is very fast and affordable these days, yet is ephemeral. Every time the process restarts, memory is cleared out and everything has to be reloaded from scratch. To address this we have recently added Memory Mapped Files support to our solution - NFX Pile . With memory mapped files , the data can be quickly fetched from disk after a restart.

Overall, the Big Memory approach is beneficial for developers and businesses as it shifts the paradigm of high-performance computing on the .NET platform. Traditionally Big Memory systems were built in C/C++ style languages where you primarily dealt with strings and byte arrays. But it is hard to solve any real world business problems while focusing on low level data structures. So instead we are going to concentrate on CLR objects. Memory Pile allows developers to think in terms of object instances, and work with hundreds of millions of those instances that have properties, code, inheritance and other CLR-native functionality.

Related Sponsored Content

This is different from language-agnostic object models, as proposed by some vendors (i.e. ones that interoperate Java and .NET), which introduce extra transformations, and all of the out-of-process solutions that require extra traffic/context switching/serialization. Instead, we’re going to discuss in-process local heaps, or rather “Piles” of objects, which exist in managed code in large byte arrays. Individually, these objects are invisible to the GC.

Use Cases

Why would anyone use dozens or hundreds of gigabytes of RAM in a first place? Here are a few tested use-cases of the Big Memory Pile technology.

The first thing that comes to mind is cache. In an E-Commerce backend we store hundreds of thousands of products ready to be displayed as detailed catalog listings. Each may have dozens of variations. When you build a catalog view listing 30+ products on a single screen, you’d better get those objects pretty quickly even for a single user scrolling a page with progressive loading. Why not use Redis or Memcached? Because we do the same thing only in-process, saving on network traffic and serialization. Transforming data into network packets into objects can be a surprisingly expensive operation. Wouldn’t you use a Dictionary<id, Product> (or IMemoryCache ) if it were possible to hold all several hundred thousand products and their variations? Caching data alone provided enough motivation for using RAM, but there is much more...

In another cache use-case - a REST API server we were able to pre-serialize around 50 million rarely changing JSON vectors as UTF8-encoded byte arrays. The byte[], which was around 1024 bytes, could then be served directly into Http stream, making the network the bottleneck at around 80,000 req/sec.

Working with complex object graphs is another perfect case for Pile. In a social app, we needed to traverse the conversation threads on Twitter. When tracing who said what and when on a social media site, the ability to hold hundreds of millions of small vectors in memory is invaluable. We might as well have used a graph DB, however in our case we are the graph DB, right in the same process (it is a component hosted by our web MVC app). We’re now handling 100K+ REST API calls/sec, which is the limit of our network connection, while keeping the CPU usage low.

In this, and other use cases, background workers asynchronously update the social graph as changes come in. In many cases, such as the product catalog we talked about earlier, this can be done preemptively. You couldn’t do that with a normal cache that only holds a subset of the interesting data.

How it Works

Big Memory Pile solves the GC problems by using the transparent serialization of CLR object graphs into large byte arrays, effectively “hiding” the objects from GC’s reach. Not all object types need to be fully serialized though, - string and byte[] objects are written into Pile as buffers bypassing all serialization mechanisms yielding over 6 M inserts/second for a 64 char string on a 6 core box.

The key benefit of this approach is its practicality. The real-life cases have shown the phenomenal overall performance while using the native CLR object model - this saves development time because you don't need to create special-purpose DTOs, and works faster, as there are no extra copies in-between that need to be made.

Overall, Pile has turned much of the I/O bound code into a CPU-bound code. What should have normally been a typical case for an async (with i/o bound) implementation, became 100% sync linear code, which is simpler and performs better as Tasks and other async/await goodies have a hidden cost (see here and here) when doing multi 100K ops/sec on a single server.

Big Memory Mapped Files

In-memory processing is fast and easy to implement, however when the process restarts you lose the dataset, which is large by definition (tens to hundreds of gigabytes). Pulling all of that data from its original source can be very time consuming, time that you can’t afford just after a restart.

To solve this we added Memory Mapped File (MMF) support using standard .NET classes: MemoryMappedFile and MemoryMappedViewAccessor. Now, instead of using byte[] as a backing store for memory segments, we use MemoryMappedViewAccessor instance and some low-level tricks to access data by pointers directly - all of this is still done using standard C#, no C++ is involved as we want to keep everything simple, especially the build chain.

Writing to memory via MemoryMappedViewAccessor (MMFMemory class) modifies virtual memory pages in the OS layer directly. The OS tries to fit those pages in physical RAM, if it can’t it swaps them out to disk. A nice feature of writing Pile into MMF is you don’t need to re-read everything from disk even after the process restarts soon after shutdown. The OS keeps the pages that have been mapped into process address space around even AFTER the process termination. Upon start, the MMFPile can access the pages already in RAM in a much quicker fashion than reading from disk anew.

Do note that MMFPile yields slower performance than DefaultPile (based on byte[]) due to the unmanaged code context switch done in the MMFMemory class.

Here are some test results:

Benchmark insert 200,000,000 string[32] 12 threads:

(Machine: Intel Core I7 3.2 Ghz, 6 Core, Win 7 64bit, VS2017, .NET 4.5)

DefaultPile

24 sec @ 8.3 M insert/sec = 8.5 Gb memory; Full GC < 8 ms

MMFPile

41 sec @ 4.9 M insert/sec = 8.5 GB memory + disk; Full GC < 10 ms

Flush all data to disk on Stop(): 10 sec

Read all data back to ram: 48 sec = ~ 177 mbyte/sec

As you can see, the MMF solution does have an extra cost; the throughput is lower due to unmanaged MMF transition, and once you mount the Pile back from disk, it takes time proportional to the amount of memory allocated to warm-up the RAM with data from disk. However you do not need to wait to load the whole working set back, as the MMFPile is instantly available for writes and reads after the Pile.Start(), the full load of all data is going to take time, in the example above the 8.5 GB dataset takes 48 sec to warm-up in RAM on a mid-grade SSD.

Benchmark insert 200,000,000 Person (class with 7 fields) objects 12 threads:

DefaultPile

85 sec @ 2.4 M insert/sec = 14.5 Gb memory; Full GC < 10ms

MMFPile

101 sec @ 1.9 M insert/sec = 14.5 GB memory + disk; Full GC < 10ms

Flush all data to disk on Stop(): 30 sec

Read all data back to ram: 50 sec = ~ 290 mbyte/sec

Other Improvements

Since our previous post on InfoQ we have made a number of improvements to the NFX.Pile:

Raw Allocator / Layered Design

The Pile implementation is now better layered, allowing us to treat string and byte[] as directly writeable/readable from the large contiguous blocks of RAM. The whole serialization mechanism is bypassed for byte[] completely, making it possible to use Pile as just a raw byte[] allocator.

var ptr = pile.Put(“abcdef”);//this will bypass all serializers //and use UTF8Encoding instead var original = pile.Get(ptr) as string;

Performance Boost

The segment allocation logic has been revised and yields 50%+ better performance during inserts from multiple threads due to introduction of sliding window optimization that tries to avoid multi-threading contention. Also, strings and byte[] are now bypassing the serializer completely yielding 5M+ inserts/sec for most cases (200%+ improvement)

Enumeration

It is now possible to get the contents of the whole pile as it implements IEnumerable<PileEntry> interface. PileEntry struct

foreach(var entry in pile) { Console.WriteLine(“{0} points to {1} bytes”.Args( entry.Pointer, entry.Size)); var data = pile.Get(entry.Pointer); … }

Durable Cache

For performance reasons, the default mode for the cache is “Speculative”. In this mode hash code collisions may cause lower priority items to be ejected from the cache even when there is otherwise enough memory.

The cache server can now store data in a “Durable” mode , which works more like a normal dictionary. Because durable mode needs to do rehashing in the bucket, it is 5-10% slower than speculative mode. This is hardly noticeable for most applications, but you’ll need to test to see what is best for your particular situation.

//Specify TableOptions for ALL tables, make tables DURABLE cache.DefaultTableOptions = new TableOptions("*") { CollisionMode = CollisionMode.Durable };

In-Place Object Mutation and Pre-allocation

It is now possible to alter objects at the existing PilePointer address. The new API Put(PilePointer...) allows for placing a different payload at the existing location. If the new payload does not fit in the existing block, then Pile creates an internal link to the new location (a la file system link in *nix systems) effectively making the original pointer an alias to the new location. Deleting the original pointer deletes the link and what it points to. The aliases are completely transparent and yield the target payload on read.

You can also pre-allocate more RAM for the future payload by specifying the preallocateBlockSize in the Put() call.

//Implement linked list stored in Pile public class ListNode { public PilePointer Previous; public PilePointer Next; public PilePointer Value; } ... private IPile m_Pile;//big memory pile private PilePointer m_First;//list head private PilePointer m_Last;//list tail ... //Append a person instance to a person linked list stored in a Pile //returns last node public PilePointer Append(Person person) { var newLast = new ListNode{ Previous = m_Last, Next = PilePointer.Invalid, Value = m_Pile.Put(person)}; var existingLast = m_Pile.Get(m_Last); existingLast.Next = node; m_Pile.Put(m_Last, existingLast);//in-place edit at the existing ptr m_Last m_Last = m_Pile.Put(newLast);//add new node to the tail return m_Last; }

For more information see our video: .NET Big Memory Object Pile - Use 100s of millions of objects in RAM

Links

About the Author

Dmitriy Khmaladze has over 20 years of IT experience in the US. Startups and Fortune 500 clients; Galaxy Hosted, Pioneered SaaS for medical industry in 1998; 15+years research: language and compiler design, distributed architecture; System programming and architecture, C/C++,.NET, Java, Android, IOS, Web design, HTML5, CSS, JavaScript, RDBMSs and NoSQL/NewSQL.