Reading /proc/pid/cmdline can hang forever

Back in August, I wrote that fork() can fail and it made a pretty big splash. Continuing with that general theme, I'll tell you about something else that can fail that you probably won't expect. It has a failure mode that you're probably not equipped to handle.

Here's the short version: ps, pgrep, top, and all of those things can hang. I'm talking state D, uninterruptible wait, just like when you "cat /something/on/a/nfs/mount" with the server down. Not even ^C will get you out of this one. It can't be killed.

Do your system maintenance scripts and one-off tools expect things like ps and pgrep to hang? What about pkill? How about stuff that randomly grovels around in /proc looking for bits of data? I bet they don't.

Now that I've told you what happens, let's work backwards into why.

Linux systems have this wonderful pseudo filesystem called /proc. You can get all sorts of neat data about what's running. /proc/pid/cmdline will tell you more or less what the argv looks like for a given process. /proc/pid/exe is probably a link to the actual executable that's running (even if it's been deleted). You get the idea.

Recent Linux systems also have these things called cgroups. You might call them containers. You can use them to enforce limits on certain resources that are smaller than the whole machine provides. A program might be limited to "only" 2 GB of memory instead of the whole 4 GB machine, for instance. You might put caps on how much CPU time it gets, which cores it runs on, or how much disk bandwidth it can consume.

The most common use of cgroups I've seen so far is the "memcg" for memory limits. Someone will come along and create a container, set a limit on it, then plop some processes into it. If they grow too big, the kernel's OOM killer will fire inside the container, something will die, and life goes on.

Assuming you don't run afoul of any kernel bugs in the oomkiller itself, you're fine. No, the problems start when you disable the kernel's OOM killer and try to act on your own.

Some process managers have decided they would rather be the ones who keep tabs on memory size and do the killing themselves. To do this, they run a task in a container, set a limit, and disable the kernel's oomkiller behavior in that container. Then they wait for notifications of it getting too big and kill it themselves.

This is all well and good, but have you ever seen what happens when the process manager fails? Life gets ... interesting.

Now you have a container that's at its limit, and the kernel is saying "yup, you've gone and done it now", but it's not enforcing those limits since you told it to. All it will do now is stop accesses to that memory space until you do something about it.

What this means in practice is that calls which reference memory inside the container will hang forever. Remember /proc/pid/cmdline and /proc/pid/exe? Yeah, bad news. Those are encumbered by this and attempting to read them will get you stuck. Kiss all of those psutils goodbye. You get to troubleshoot this the hard way. Certain things under /proc/pid will work, but others will not. You'll get used to doing "cat foo &" just to avoid losing yet another login shell.

The most direct method for dealing with this is to scan through your memory cgroups (probably under /sys/fs/cgroups/memory, but your mileage may vary) and see which of them are reporting being in an OOM condition (look for "under_oom 1") and yet have the OOM killer disabled. To verify this is your problem, read the "tasks" pseudo-file in there, get a PID of something in the container, then try to access /proc/that_pid/cmdline. If it hangs, that's at least part of your problem.

Now you know about the bad cgroup, what can you do? Well, you can raise the limit, and it'll get going again, at least until it grows again and hits the new limit. You can also switch the kernel's enforcer back on and let it lay waste to whatever it wants to kill. I guess you could also reboot the machine, but that's just goofy. It's up to you.

Obviously, you will want to work backwards and find the problems in your stack that lead both to the uncontrolled memory growth and the lack of handling by your process manager. Otherwise, you'll be right back here again soon.

So really, after this is all said and done and the fire is out, you have gained a new little nugget of data. When a machine starts acting strangely, the load average is climbing without bound, commands like "ps aux" are getting stuck at the same point every time, and you're using memory cgroups, you might just be in this situation.

Now you know how to spot it, and what to do about it.

Go make things better.