Motivation:

All this started with the realisation that I could not afford a big enough computer. Audio processing requires huge amounts of memory. Audacity, an amazing free audio processor, manages this using a file backed storage system. This is a common approach for such issues where we store a huge amount of information and want random access to it. So, I wanted to develop a system for Sonic Field (my pet audio processing/synthesis project) which gave the same powerful disk based memory approach but in pure Java.

I got this to work late last year and discussed it (briefly) in the Java Advent Calendar (http://www.javaadvent.com/2014/12/a-serpentine-path-to-music.html) overview of Sonic Field. Disk based memory allows Sonic Field to process audio systems which require huge amounts of memory on my humble 16 gigabyte laptop. For example, this recent piece took over 50 gigabytes of memory to create:

Whilst this was a breakthrough, it was also inefficient. Memory intensive operations like mixing were a bottle neck in this system. Here I turn Java into a memory power house by implementing the same system but very much more efficiently. I suspect I am getting near the limit at which Java is no longer at a performance disadvantage to C++.

Last year I gave a high level overview of the method; this year I am deep diving to implementation of performance details. In so doing I will explain how we can remove the overhead of traditional Java memory access techniques and then expand the ideas for a more general approach to sharing and persisting large memory systems in JVM programming.

What Is Segmented Storage?

I admit there are a lot of concepts here. The first one to get our heads around is just how inefficient normal memory management of large memory systems is in Java. Let me be very clear indeed, I am not talking about garbage collection. Years of experience with both Java and C++ has taught me that neither collected nor explicit heap management is efficient or easy to get right. I am not discussing this at all. The issues with the JVM’s management of large memory systems are because of its bounds checking and object model. This is thrown into sharp focus when working with memory pools.

As latency or throughput performance become more critical than memory use there comes a point where one has to break out memory pools. Rather than a memory system which mixes everything together in one great glorious heap, we have pools of same sized objects. This requires more memory than a pure heap if the pool is not fully used or if the elements being mapped into pool chunks are smaller than the chunks themselves. However, pools are very fast indeed to manage.

In this post I am going to discuss pool backed segmented storage. Segmented storage is based on a pool but allows allocation of larger storage containers than a single pool chunk. The idea is that a storage container (say 1 gigabyte) can be made up of a selection of chunks (say 1 megabyte each). The segmented storage region is not necessarily made up of contiguous chunks. Indeed, this is its most important feature. It is made up of equal sized chunks from a backing pool but the chunks are scatters across virtual address space and might not even be in order. With this we have something with the request and release efficiency of a pool but but closer to the memory use efficiency of a heap and without any concerns over fragmentation.

Let us look at what a pool looks like first; then we can come back to segmentation.

A pool, in this discussion, consists of these parts:

A pool (not necessarily all in one data structure) of chunks of equal sized memory. One or more lists of used chunks. One list of free chunks.

To create a segmented memory allocation from a pool we have a loop:

Create a container (array or some such) of memory chunks. Call this the segment list for the allocation. Take a chunk of memory off the free list and add it to the segment list. See if the segment list contains equal or more total memory than required. If not repeat from 2.

Now we have an allocation segment list which has at least enough memory for the requirement. When we free this memory we simply put the chunks back on the free list. We can see from this that very quickly the chunks on the free list will no longer be in order and even if we were to sort them by address, they still would not be contiguous. Thus any allocation will have enough memory but not in any contiguous order.

Here is a worked example:

We will consider 10 chunks of 1 megabyte which we can call 1,2…10 which are initial in order.

Start: Free List: 1 2 3 4 5 6 7 8 9 10

Allocate a 2.5 megabyte store: Free List: 1 2 3 4 5 6 7 Allocated Store A: 8 9 10

Allocate a 6 megabyte store: Free List: 1 Allocated Store A: 8 9 10 Allocated Store A: 7 6 5 4 3 2

Free Allocated Store A: Free List: 10 9 8 1 Allocated Store A: 7 6 5 4 3 2

Allocate a 3.1 megabyte store: Free List: Allocated Store A: 7 6 5 4 3 2 Allocated Store C:10 9 8 1

One can note that such an approach is good for some situations for systems like 64bit C++ but its true power is for Java. In current JVMs the maximum addressable array or ByteBuffer contains only 2**31 elements segmented storage offers an efficient way of addressing much greater quantities of memory and backing that memory with memory mapped files if required.. Consider that we need 20 billion doubles, we cannot allocate them into an array or a ByteBuffer; but we can use segmented memory so that we can achieve our goal.

Using anonymous virtual memory in Java for very large memory objects can be inefficient. In use cases where we want to handle very much more memory than the RAM on the machine, we are better off using memory mapped files than just using anonymous swap space. This means that the JVM is not competing with other programs for swap space (to an extent) but what is more important is that garbage collected memory distributes object access which is particularly poor for anonymous virtual memory. We want to concentrate access to particular pages in the time domain so that we attract as few hard page faults as possible. I have discuss other concepts in this area here: https://jaxenter.com/high-speed-multi-threaded-virtual-memory-in-java-105629.html.

Given this. if we narrow our requirement to 20 billion doubles as a memory mapped file then we are not even going to be able to use magic in sun.misc.Unsafe (see later) to help. Without JNI the largest memory mapped file ‘chunk’ we can manage in Java is just 2^31 bytes. It is this requirement for memory mapped files and the inherent allocation/freeing efficiency of segmented storage approaches that lead to me using it for Sonic Field (where I often need to manage over 100G of memory on a 16G machine).

Drilling Into The Implementation:

We now have a clear set of ideas to implement. We need mapped byte buffers. Each buffer is a chunk in a pool for free chunks. When we want to allocate a storage container we need to take some of these mapped byte buffer chunks out of the free pool and into our container. When the container is freed we return our chunks to the free pool. Simple, efficient and clean.

Also, one important thing is that the mapped byte buffers are actually java.nio.DirectByteBuffer objects with file back memory. We will use this concept later; for now we can just think of them as ByteBuffers.

On Sonic Field (which is the code for which I developed the technique of segmented storage using mapped byte buffers. – see https://github.com/nerds-central/SonicFieldRepo). In that code base I have defined the following:

private static final long CHUNK_LEN = 1024 * 1024;

To get the sample we can consider each chunk as a CHUNK_LEN ByteBuffer. The code for accessing an element from an allocated memory chunk was before my speedup work:

private static final long CHUNK_SHIFT = 20; private static final long CHUNK_MASK = CHUNK_LEN - 1; ... public final double getSample(int index) { long bytePos = index << 3; long pos = bytePos & CHUNK_MASK; long bufPos = (bytePos - pos) >> CHUNK_SHIFT; return chunks[(int) bufPos].getDouble((int) pos); }

So the allocated segment list in this case is an array of ByteBuffers:

Find the index into the list by dividing the index required by the chunk size (use shift for efficiency). Find the index into the found chunk by taking the modulus (use binary and for efficiency). Look up the actual value using the getDouble intrinsic method (looks like a method but the compiler knows about it an elides the method call).

All this looks fine, but it does not work out all that well because there are some fundamental issues with the way Java lays out objects in memory which prevent segmented access being properly optimised. On the face of it, accessing a segmented memory area should be a few very fast shift and logic operations and an indirect lookup but that does not work out so for Java; all the problems happen in this line:

return chunks[(int) bufPos].getDouble((int) pos);

This is what this line has to do:

Look up the chunks object from its handle. Bounds check. Get the data from its data area. From that object handle for the ByteBuffer look up actual object. Look up its length dynamically (it can change so this is a safe point and an object field lookup). Bounds check. Retrieve the data.

Really? Yes, the JVM does all of that which is quite painful. Not only is it a lot of instructions it also requires jumping around in memory will all the consequent cache line flushing and memory pauses.

How can we improve on this? Remember that our ByteBuffers are DirectByteBuffers, this means that their data is not stored on the Java heap; it is located in the same virtual address location throughout the object lifetime. I bet you have guessed that the key here is using sun.misc.Unsafe. Yes, it is; we can bypass all this object lookup by using offheap memory. To do so means bending a few Java and JVM rules but the dividends are worth it.

From now on everything I discuss is relevant to Java 1.8 x86_64. Future versions might break this approach as it is not standards compliant.

Consider this:

private static class ByteBufferWrapper { public long address; public ByteBuffer buffer; public ByteBufferWrapper(ByteBuffer b) throws NoSuchMethodException, SecurityException, IllegalAccessException, IllegalArgumentException, InvocationTargetException { Method addM = b.getClass().getMethod("address"); addM.setAccessible(true); address = (long) addM.invoke(b); buffer = b; } }

What we are doing is getting the address in memory of the data stored in a DirectByteBuffer. To do this I use reflection as DirectByteBuffer is package private. DirectByteBuffer has a method on it called address() which returns a long. On x86_64 the size of an address (64 bits) is the same as long. Whilst the value of long is signed, we can just used long as binary data and ignore its numerical value. So the long returned from address() is actually the virtual address of the start of the buffer’s storage area.

Unlike ‘normal’ JVM storage (e.g. arrays) the storage of a DirectByteBuffer is ‘off heap’. It is virtual memory just like any other, but it is not owned by the garbage collector and cannot be moved by the garbage collector; this makes a huge difference to how quickly and by which techniques we can access it. Remember, the address returned by address() never changes for a given DirectByteBuffer object; consequently, we can use this address ‘forever’ and avoid object lookups.

Introducing sun.misc.Unsafe:

Whilst it would be lovely to believe that calling getDouble(int) on a DirectByteBuffer is super efficient, it does not appear that it is so. The bounds check slows it down despite the method being intrinsic [a magic function which the JVM JIT compiler knows about and can replace with machine code rather than compiling in a normal fashion]. However, with our address we can now use sun.misc.Unsafe to access the storage.

Rather than:

b.getDouble(pos);

We can:

unsafe.getDouble(address+pos);

The unsafe version is also intrinsic and compiles down to pretty much the same machine code as a C compiler (like gcc) would produce. In other words, it is as fast as it can get; there are no object dereferences or bounds checks, it just loads a double from an address.

The store equivalent is:

unsafe.putDouble(address+pos,value);

What is this ‘unsafe’ thing? We get that with another reflection hack around:

private static Unsafe getUnsafe() { try { Field f = Unsafe.class.getDeclaredField("theUnsafe"); f.setAccessible(true); return (Unsafe) f.get(null); } catch (Exception e) { throw new RuntimeException(e); } } private static final Unsafe unsafe = getUnsafe();

It is important to load the unsafe singleton into a final static field. This allows the compiler to assume the object reference never changes and so the very most optimal code is generated.

Now we have very fast acquisition of data from a DirectByteBuffer but we have a segmented storage model so we need to get the address for the correct byte buffer very quickly. If we store these in an array we risk the array bounds check and the array object dereference steps. We can get rid of these by further use of unsafe and offheap memory.

private final long chunkIndex; ... try { // Allocate the memory for the index - final so do it here long size = (1 + ((l << 3) >> CHUNK_SHIFT)) << 3; allocked = chunkIndex = unsafe.allocateMemory(size); if (allocked == 0) { throw new RuntimeException("Out of memory allocating " + size); } makeMap(l << 3l); } catch (Exception e) { throw new RuntimeException(e); }

Again we use the ‘final’ trick to let the compiler make the very best optimisations. The final here is a long which is just an address. We can directly allocate offheap memory using unsafe. The imaginatively called function to do this is allocateMemory(long). This returns a long which we store in chunkIndex. allocateMemory(long) actually allocates bytes but we want to store what is effectively an array of longs (addresses); this is what the bit of bit twiddling logic is doing when it computes size.

Now that we have a chunk of offheap memory large enough to store the addresses for the DirectByteBuffer segments for our storage container we can put the addresses in and retrieve them using unsafe.

During storage construction we:

// now we have the chunks we get the address of the underlying memory // of each and place that in the off heap lookup so we no longer // reference them via objects but purely as raw memory long offSet = 0; for (ByteBufferWrapper chunk : chunks) { unsafe.putAddress(chunkIndex + offSet, chunk.address); offSet += 8; }

Which means our new code for getting and setting data can be very simple indeed:

private long getAddress(long index) { long bytePos = index << 3; long pos = bytePos & CHUNK_MASK; long bufPos = (bytePos - pos) >> CHUNK_SHIFT; long address = chunkIndex + (bufPos << 3); return unsafe.getAddress(address) + pos; } /* (non-Javadoc) * @see com.nerdscentral.audio.SFSignal#getSample(int) */ @Override public final double getSample(int index) { return unsafe.getDouble(getAddress(index)); } /* (non-Javadoc) * @see com.nerdscentral.audio.SFSignal#setSample(int, double) */ @Override public final double setSample(int index, double value) { unsafe.putDouble(getAddress(index), value); return value; }

The wonderful thing about this is the complete lack of object manipulation or bounds checking. OK, if someone asks for at sample which is out of bounds, the JVM will crash. That might not be a good thing. This sort of programming is very alien to many Java coders and we need to take its dangers very seriously. However, it is really quite fast compared to the original.

In my experiments, I have found that the default JVM inline settings are a little too conservative to get the best out of this approach. I have seen large speedups (up to a two times performance improvement) with the following command line tweaks.

-XX:MaxInlineSize=128 -XX:InlineSmallCode=1024

These just let the JVM make a better job of utilising the extra performance available through not being forced to perform bounds checks and object lookups. In general, I would not advise fiddling with JVM inline settings, but in this case I have real benchmark experience to show a benefit for complex offheap access work.

Testing – How Much Faster Is It?

I wrote the following piece of Jython to test:

import math from java.lang import System sf.SetSampleRate(192000) count=1000 ncount=100 def test(): t1=System.nanoTime() for i in range(1,ncount): signal=sf.Mix(+signal1,+signal2) signal=sf.Realise(signal) -signal t2=System.nanoTime() d=(t2-t1)/1000000.0 print "Done: " + str(d) return d signal1=sf.Realise(sf.WhiteNoise(count)) signal2=sf.Realise(sf.WhiteNoise(count)) print "WARM" for i in range(1,100): test() print "Real" total=0.0 for i in range(1,10): total+=test() print "Mean " + str(total/9.0) -signal1 -signal2

What this does is create some stored doubles and then create new ones and reading from the old into the new over and over. Remember that we are using segmented storage backed by a pool; consequently, we only truly allocate that storage initially and after that the ‘chunks’ are just recycled. This architecture means that our execution time is dominated by executing getSample and setSample, not allocation or any other paraphernalia.

How much faster is our off heap system? On my Macbook Pro Retina I7 machine with Java 1.8.0 I got these figures for the ‘Real’ (i.e. post warm up) operations (smaller is better):

For the unsafe memory model:

Done: 187.124

Done: 175.007

Done: 181.124

Done: 175.384

Done: 180.497

Done: 180.688

Done: 183.309

Done: 178.901

Done: 181.746

Mean 180.42

For the traditional memory model:

Done: 303.008

Done: 328.763

Done: 299.701

Done: 315.083

Done: 306.809

Done: 302.515

Done: 304.606

Done: 300.291

Done: 342.436

Mean 311.468

So our unsafe memory model is 1.73 times faster than the traditional Java approach!

Why Is It 1.73 Times Faster

We can see why.

If we look back at the list of things required to just read a double from the traditional DirectByteBuffer and array approach:

Look up the chunks object from its handle. Bounds check. Get the data from its data area. From that object handle for the ByteBuffer look up actual object. Look up its length dynamically (it can change so this is a safe point and an object field lookup). Bounds check. Retrieve the data.

With the new approach we have:

Retrieve the address of the chunk Retrieve the data from that chunk

Not only are there very many fewer machine instructions being issued, the memory access is much more localised which almost certainly enhances the cache usage during data processing.

The source code for the fast version of the storage system as described here is: https://github.com/nerds-central/SonicFieldRepo/blob/cf6a1b67fb8dd07126b0b1274978bd850ba76931/SonicField/src/com/nerdscentral/audio/SFData.java

I am hoping that you, the reader, have spotted one big problem I have not addressed! My code is allocating offheap memory when ever it creates a segmented storage container. However, this memory will not be freed by the garbage collector. We could try to free with with finalizers but there are many reasons why this is not such a great idea.

My solution is to use explicit resource management. Sonic Field uses try with resources to manage its memory via reference counts. When the reference count for a particular storage container hits zero, the container is freed which places it storage chunks back in the free list and uses unsafe to free the address lookup memory.

Other Uses And New Ideas

Nearly a year ago now I posted ‘Java Power Features To Stay Relevant‘; I guess it was a controversial post and not everyone I have spoken to about my ideas finds them agreeable (to say the least). Nevertheless, I still believe the JVM has a challenge on its hands. The complex multi-threaded model of Java and the JVM its self is not necessarily the huge benefit people think it should be in the world of multi-core computing. There is still a lot of interest in using multiple small processes which communicate via shared memory or sockets. With the slow but inevitable increase in RDMA based networking, these approaches will seem more and more natural to people.

Java and JVM languages seem to have managed to make themselves uniquely unable to take advantage of these shifts in thinking. By developing a ‘walled garden’ approach the JVM has become very efficient at working internally but not great at working with other processes. This is a performance issue and also a stability issue; no matter how hard we try, there is always a chance the JVM will crash or enter an unstable state (OutOfMemoryError anyone?). In production systems this often necessitates several small JVM instances working together so if one goes away the production system stays up. Memory mapped files are a great way to help with persisting data even when a JVM process goes away.

All these issues lead me to another reason I am very interested in efficient offheap, mapped file architectures for the JVM. This technology sits at the overlap of shared memory and mapped file technologies which are now driving forces behind high speed, stable production environments. Whilst the system I discussed here is for a single JVM, using offheap atomics (see here: http://nerds-central.blogspot.co.uk/2015/05/synchronising-sunmiscunsafe-with-c.html) we can put the free list offheap and share it between processes. Shared memory queues can then also give interprocess arbitration of segmented storage allocation and utilisation. Suddenly, the segmented storage model becomes an efficient way for multiple processes, both JVM and other technologies (Python, C++ etc) to share large, file persisted memory systems.

Right now there are some issues. The biggest of which is that whilst Java supports shared memory via memory mapped files it does not support that via pure shared memory. File mapping is an advantage if we are interested in large areas of memory (as in this example) but it is an unnecessary performance issue for small areas of rapidly changing memory which do not require persistence. I would like to see a true shared memory library in the JDK; this is unlikely to happen any time soon (see my point about a walled garden). JNI offers a route but then JNI has many disadvantages we well. Maybe project Panama will give the required functionality and finally break down the JVM’s walls.

To bring all this together the next trick I want to try is mapping files to a ramdisk (there is an interesting write up on this here: http://www.jamescoyle.net/knowledge/951-the-difference-between-a-tmpfs-and-ramfs-ram-disk). This should be quite easy on Linux and would let us place interprocess queues in a pure RAM shared memory areas without using JNI. With this piece done, a pure Java high speed interprocess shared memory model would be insight. Maybe that will have to wait for next year’s calendar?

Author: Alexander Turner Life long technologist and creative.

Like this: Like Loading...