To blog Previous post | Next post

Out of memory: Kill process or sacrifice child

It is 6 AM. I am awake summarizing the sequence of events leading to my way-too-early wake up call. As those stories start, my phone alarm went off. Sleepy and grumpy me checked the phone to see whether I was really crazy enough to set the wake-up alarm at 5AM. No, it was our monitoring system indicating that one of Plumbr services went down.

As a seasoned veteran in the domain, I made the first correct step towards solution by turning on the espresso machine. With a cup of coffee I was equipped to tackle the problems. First suspect, application itself seemed to have behave completely normal before the crash. No errors, no warning signs, no trace of any suspects in the application logs.

The monitoring we have in place had noticed the death of the process and had already restarted the crashed service. But as I already had caffeine in my bloodstream, I started to gather more evidence. 30 minutes later I found myself staring at the following in the /var/log/kern.log :

Jun 4 07:41:59 plumbr kernel: [70667120.897649] Out of memory: Kill process 29957 (java) score 366 or sacrifice child Jun 4 07:41:59 plumbr kernel: [70667120.897701] Killed process 29957 (java) total-vm:2532680kB, anon-rss:1416508kB, file-rss:0kB

Apparently we became victims of the Linux kernel internals. As you all know, Linux is built with a bunch of unholy creatures ( called ‘daemons’). Those daemons are shepherded by several kernel jobs, one of which seems to be especially sinister entity. Apparently all modern Linux kernels have a built-in mechanism called “Out Of Memory killer” which can annihilate your processes under extremely low memory conditions. When such a condition is detected, the killer is activated and picks a process to kill. The target is picked using a set of heuristics scoring all processes and selecting the one with the worst score to kill.

Understanding the “Out Of Memory killer”

By default, Linux kernels allow processes to request more memory than currently available in the system. This makes all the sense in the world, considering that most of the processes never actually use all of the memory they allocate. The easiest comparison to this approach would be with the cable operators. They sell all the consumers a 100Mbit download promise, far exceeding the actual bandwidth present in their network. The bet is again on the fact that the users will not simultaneously all use their allocated download limit. Thus one 10Gbit link can successfully serve way more than the 100 users our simple math would permit.

A side effect of such approach is visible in case some of your programs is on the path of depleting the system’s memory.This can lead to extremely low memory conditions, where no pages can be allocated to process. You might have faced such situation, where not even a root account cannot kill the offending task. To prevent such situations, the killer activates, and identifies the process to be the killed.

You can read more about fine-tuning the behaviour of “Out of memory killer” from this article in RedHat documentation.

Did you know that 20% of Java applications have memory leaks? Don’t kill your application – instead find and fix leaks with Plumbr in minutes.

What was triggering the Out of memory killer?

Now that we have the context, it is still unclear what was triggering the “killer” and woke me up at 5AM? Some more investigation revealed that:

The configuration in /proc/sys/vm/overcommit_memory allowed overcommitting memory – it was set to 1, indicating that every malloc() should succeed.

allowed overcommitting memory – it was set to 1, indicating that every malloc() should succeed. The application was running on a EC2 m1.small instance. EC2 instances have disabled swapping by default.

Those two facts combined with the sudden spike in traffic in our services resulted in the application requesting more and more memory to support those extra users. Overcommitting configuration allowed to allocate more and more memory for this greedy process, eventually triggering the “Out of memory killer” who was doing exactly what it is meant to do. Killing our application and waking me up in the middle of the night.

Example

When I described the behaviour to engineers, one of them was interested enough to create a small test case reproducing the error. When you compile and launch the following Java code snippet on Linux (I used the latest stable Ubuntu version):

package eu.plumbr.demo; public class OOM { public static void main(String[] args){ java.util.List<int[]> l = new java.util.ArrayList(); for (int i = 10000; i < 100000; i++) { try { l.add(new int[100_000_000]); } catch (Throwable t) { t.printStackTrace(); } } } }

then you will face the very same Out of memory: Kill process <PID> (java) score <SCORE> or sacrifice child message.

Note that you might need to tweak the swapfile and heap sizes, in my testcase I used the 2g heap specified via -Xmx2g and following configuration for swap:

swapoff -a dd if=/dev/zero of=swapfile bs=1024 count=655360 mkswap swapfile swapon swapfile

Solution?

There are several ways to handle such situation. In our example, we just migrated the system to an instance with more memory. I also considered allowing swapping, but after consulting with engineering I was reminded of the fact that garbage collection processes on JVM are not good at operating under swapping, so this option was off the table.

Other possibilities would involve fine-tuning the OOM killer, scaling the load horizontally across several small instances or reducing the memory requirements of the application.

If you found the study interesting – follow Plumbr in Twitter or RSS, we keep publishing our insights about Java internals.