It’s not unusual in financial service systems to have problems that requires significant vertical, as opposed to horizontal, scaling. During his talk at QCon London Peter Lawrey described the particular problems that occur when you scale a Java application beyond 32GB.

Starting from the observation that Java responds much faster if you can keep your data in memory rather than going to a database or some other external resource, Lawrey described the kind of problems you hit when you go above the 32GB range that Java is reasonably comfortable in. As you’d expect GC pause times become a major problem, but also memory efficiency drops significantly, and you have the problem of how to recover in the event of a failure.

Suppose your system dies and you want to pull in your data to rebuild that system. If you are pulling data in at around 100 MB/s, which is not an unreasonable rate (it’s about the saturation point of a gigabit line and if you have a faster network you may not want to be maxing out those connections because you still want to be able to handle user requests), then 10GB takes about 2 minutes to recover, but as your data sets get larger you are getting into hours or even days. A Petabyte is about 4 months which is obviously unrealistic, particularly if your data is also changing.

Generally this problem is solved by replicating what is going on in a database. Lawrey mentioned Speedment SQL Reflector as one example of a product that can be used to do this.

The memory efficiency problems stem from the observation that as a language Java tends to produce a lot of references. Java 7 has compressed Oops turned on by default if your heap is less than 32GB. This means it can use 32 bit references, instead of 64 bit references, for every reference. If your heap is below about 4GB the JVM can use direct addressing. Above this it uses the fact that every object in Java is aligned by 8 bytes which means that the bottom 3 bits of every address are always zero. Intel processors have a feature that allows them to intrinsically multiply a number by 8 before it is used as an address - so there is hardware support for multiplying whatever that number is by 8, allowing us to get 32GB instead of only 4.

Moving to Java 8 doubles the limit, since Java 8 adds support for an object alignment multiplier of 16 allowing it to address a 64GB heap using a 32GB reference. However, if you need to go beyond 64GB in Java 8 then your only option is to use 64 bit references, which adds a small but significant overhead on main memory use. It also reduces the efficiency of CPU caches as fewer objects can fit in.

At this scale GC pause times can become a significant issue. Lawery noted that Azul Zing is fully concurrent collector with worst case pause times around the 1-10 ms. Zing uses an extra level of indirection which allows it to move an object whilst it is being used, and it will scale to 100s of GBs.

Another approach is to have a library that does the memory management for you but in Java code - a product like Terracotta BigMemory or Hazelcast High-Density Memory Store can cache large amounts of data either within a single machine or across multiple machines. BigMemory uses off heap memory to store the bulk of its data. The difference between these solutions and Zing is that you only get the benefit of the extra memory if you go through their library.

Another limit you hit is that a lot of systems have NUMA regions limited to a terabyte. This isn’t set in stone, but Ivy and Sandybridge Xeon processors are limited to addressing 40 bits of memory. In Haswell this has been lifted to 46 bits. Each socket has “local” access to a bank of memory, however to access the other bank it needs to use a bus. This is much slower, and the GC in Java can perform very poorly if it doesn’t sit within one NUMA region, because the collector assumes it has random access to the memory - and so if you see a GC go across NUMA regions it can suddenly slow down dramatically and will also perform somewhat erratically.

To get beyond 40 bits a lot of CPUs support a 48 bit address space - both the Intel and AMD 64bit chips do this. What they do is that they have multi-tiered lookup of how to find any given page in memory. This means the page in memory may be in a different place from where your application thinks it is, and in fact you can have more virtual memory in your application that you have physical memory. The virtual memory is generally in the form of memory map files taken from disk. So this introduces a 48 bit limit for the maximum size of an application. Within CentOS that is 256TB, under Windows 192TB. The point really is that memory mappings are not limited to main memory size.

Multiple JVMs within the same machine can share the same shared memory. Lawrey described a design he had put together for a potential client where they needed a Petabyte JVM (50 bits). You can’t actually map a Petabyte all in one go, so you have to cache the memory mappings to fit within the 256TB limit. In this case the prospect is looking to attach a Petabyte of Flash drive to a single machine (and have more than one of these).

This is what it looks like:

It’s running on a machine with 6TB and 6 NUMA regions. Since, as previously noted, we want to try and restrict the heap or at least the JVM to a single NUMA region, you end up with 5 JVMs with a heap of up to 64GB each, and memory mapped caches for both the indexes and the raw data, plus a 6th NUMA region reserved just for the operating system and monitoring tasks.

Replication is essential - restoring such a large machine would take a considerable length of time, and since the machine is so complex the chance of a failure is quite high.

In the Q&A that followed an attendee asked about the trade-offs of scaling this way versus horizontal scaling. Lawrey explained that in this use case they need something close to random access and for that they don't want to be going across the network to get their data. “The interesting thing,” observed Lawrey, “is that you could even consider doing this in Java”.