Let’s roll over different parts of that NMT output to see if those parts are tunable.

Start with something familiar:

- GC (reserved=10991KB, committed=10991KB) (malloc=10383KB #129) (mmap: reserved=608KB, committed=608KB)

This accounts for GC native structures. The log says GC malloc-ed around 10 MB and mmap-ed around 0.6 MB. One should expect this to grow with increasing heap size, if those structures describe something about the heap — for example, marking bitmaps, card tables, remembered sets, etc. Indeed it does so:

# Xms/Xmx = 512 MB - GC (reserved=29543KB, committed=29543KB) (malloc=10383KB #129) (mmap: reserved=19160KB, committed=19160KB) # Xms/Xmx = 4 GB - GC (reserved=163627KB, committed=163627KB) (malloc=10383KB #129) (mmap: reserved=153244KB, committed=153244KB) # Xms/Xmx = 16 GB - GC (reserved=623339KB, committed=623339KB) (malloc=10383KB #129) (mmap: reserved=612956KB, committed=612956KB)

Quite probably malloc-ed parts are the C heap allocations of task queues for parallel GC, and mmap-ed regions are the bitmaps. Not surprisingly, they grow with heap size, and take around 3-4% from the configured heap size. This raises deployment questions, like in the original question: configuring the heap size to take all available physical memory will blow the memory limits, possibly swapping, possibly invoking OOM killer.

But that overhead is also dependent on the GC in use, because different GCs choose to represent Java heap differently. For example, switching back to the most lightweight GC in OpenJDK, -XX:+UseSerialGC , yields this dramatic change in our test case:

-Total: reserved=1374184KB, committed=75216KB +Total: reserved=1336541KB, committed=37573KB -- Class (reserved=1066093KB, committed=14189KB) +- Class (reserved=1056877KB, committed=4973KB) (classes #391) - (malloc=9325KB #148) + (malloc=109KB #127) (mmap: reserved=1056768KB, committed=4864KB) -- Thread (reserved=19614KB, committed=19614KB) - (thread #19) - (stack: reserved=19532KB, committed=19532KB) - (malloc=59KB #105) - (arena=22KB #38) +- Thread (reserved=11357KB, committed=11357KB) + (thread #11) + (stack: reserved=11308KB, committed=11308KB) + (malloc=36KB #57) + (arena=13KB #22) -- GC (reserved=10991KB, committed=10991KB) - (malloc=10383KB #129) - (mmap: reserved=608KB, committed=608KB) +- GC (reserved=67KB, committed=67KB) + (malloc=7KB #79) + (mmap: reserved=60KB, committed=60KB) -- Internal (reserved=9444KB, committed=9444KB) - (malloc=9412KB #1373) +- Internal (reserved=204KB, committed=204KB) + (malloc=172KB #1229) (mmap: reserved=32KB, committed=32KB)

Note this improved both "GC" parts, because less metadata is allocated, and "Thread" part, because there are less GC threads needed when switching from Parallel (default) to Serial GC. This means we can get partial improvement by tuning down the number of GC threads for Parallel, G1, CMS, Shenandoah, etc. We’ll see about the thread stacks later. Note that changing the GC or the number of GC threads will have performance implications — by changing this, you are selecting another point in space-time tradeoffs.

It also improved "Class" parts, because metadata representation is slightly different. Can we squeeze out something from "Class"? Let us try Class Data Sharing (CDS), enabled with -Xshare:on :

-Total: reserved=1336279KB, committed=37311KB +Total: reserved=1372715KB, committed=36763KB -- Symbol (reserved=1356KB, committed=1356KB) - (malloc=900KB #65) - (arena=456KB #1) - +- Symbol (reserved=503KB, committed=503KB) + (malloc=502KB #12) + (arena=1KB #1)

There we go, saved another 0.5 MB in internal symbol tables by loading the pre-parsed representation from the shared archive.

Now let’s focus on threads. The log would say:

- Thread (reserved=11357KB, committed=11357KB) (thread #11) (stack: reserved=11308KB, committed=11308KB) (malloc=36KB #57) (arena=13KB #22)

Looking into this, you can see that most of the space taken by threads are the thread stacks. You can try to trim the stack size down from the default (which appears to be 1M in this example) to something less with -Xss . Note would yield a greater risk of StackOverflowException -s, so if you do change this option, be sure to test all possible configurations of your software to look out for ill effects. Adventurously setting this to 256 KB with -Xss256k yields:

-Total: reserved=1372715KB, committed=36763KB +Total: reserved=1368842KB, committed=32890KB -- Thread (reserved=11357KB, committed=11357KB) +- Thread (reserved=7517KB, committed=7517KB) (thread #11) - (stack: reserved=11308KB, committed=11308KB) + (stack: reserved=7468KB, committed=7468KB) (malloc=36KB #57) (arena=13KB #22)

Not bad, another 4 MB is gone. Of course, the improvement would be more drastic with more application threads, and it will quite probably be the second largest consumer of memory after Java heap.

Continuing on threading, JIT compiler itself also has threads. This partially explains why we set stack size to 256 KB, but the data above says the average stack size is still 7517 / 11 = 683 KB . Trimming the number of compiler threads down with -XX:CICompilerCount=1 and setting -XX:-TieredCompilation to enable only the latest compilation tier yields:

-Total: reserved=1368612KB, committed=32660KB +Total: reserved=1165843KB, committed=29571KB -- Thread (reserved=7517KB, committed=7517KB) - (thread #11) - (stack: reserved=7468KB, committed=7468KB) - (malloc=36KB #57) - (arena=13KB #22) +- Thread (reserved=4419KB, committed=4419KB) + (thread #8) + (stack: reserved=4384KB, committed=4384KB) + (malloc=26KB #42) + (arena=9KB #16)

Not bad, three threads are gone, and their stacks gone too! Again, this has performance implications: less compiler threads means slower warmup.

Trimming down Java heap size, selecting appropriate GC, trimming down the number of VM threads, trimming down the Java stack thread sizes and thread counts are the general techniques for reducing VM footprint in memory-constrained scenarios. With these, we have trimmed down our 16 MB Java heap test case to: