I have been having an occasional issue with my Linux web server, where the apache process goes out of control and eats up all the RAM over a period of a few hours. Eventually swap becomes exhausted, and the dreaded oomkiller starts, and the server then become unresponsive. This is bad, since the server is colocated in a remote datacenter, and the only way I have to log in is via ssh. Once oomkiller starts doing its thing, I can't log in, so a hard reset is required. The last time I did this I ended up losing some files (recovered from backup, but a pain nonetheless).

So I wanted to somehow keep an eye on things to stop the situation from degrading so far, but while I found various methods for monitoring the memory from the command line, I wanted a script that could be run automatically to check the health of the memory and take actions as appropriate. This is what swapwatch is for.

Bookmark | Edit | | Report | Link File: swapwatch Type: a /usr/bin/perl script text executable Size: 3 KB

This utility is intended to be run regularly from cron, say every minute. I don't think this will impact the server performance at all, since the script doesn't do anything under normal circumstances except read and parse a very small text file.

The script reads /proc/meminfo, which contains stats on current memory usage, for example:

shell> more /proc/meminfo MemTotal: 4063248 kB MemFree: 55528 kB Buffers: 339824 kB Cached: 2456500 kB SwapCached: 12480 kB Active: 1651604 kB Inactive: 2114848 kB SwapTotal: 3863624 kB SwapFree: 3806356 kB Dirty: 6544 kB Writeback: 0 kB AnonPages: 967272 kB Mapped: 16820 kB Slab: 194364 kB SReclaimable: 176824 kB SUnreclaim: 17540 kB PageTables: 15140 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 5895248 kB Committed_AS: 1834468 kB VmallocTotal: 34359738367 kB VmallocUsed: 26340 kB VmallocChunk: 34359711767 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB

The lines we are interested in are SwapTotal and SwapFree. Together, these can give us an idea of the "health" of the system, at least in terms of memory usage. On most production systems there really should not be much, if any, swap space in use during normal operation.

The script takes a sequence of command line parameters which specify levels of free swap, and actions that should be taken if the free swap should fall below these levels. Here is the syntax:

swapwatch <level[:action]> ...

So the ':action' part is optional, and you can have multiple level:action tuples. The levels are percentages, for example 75 means 75%. This refers to free swap as a percentage of total swap.

For example if you say:

swapwatch 50 '25:apache restart'

This means issue a warning if the free swap level falls below 50% of total swap, and take action if the level falls below 25% (the action is 'apache restart'). Note that for command line parameters, you don't need quotes around a simple number, but you do if you have a command sequence that has any spaces. Quoting allows you to group together multiple words as one single command line param, which is what we want here.

The level:action tuples are evaluated in numerical order, according to the level. That means the lowest swap levels (i.e. the worst conditions) are tested first. So if we have params for levels of 80, 50 and 25, then the 25 is evaluated first, then 50, then 80, regardless of the order in which these params are given on the command line. Also, only the first matching action is executed. This seems to make sense, because if we have, say, a last ditch action to reboot the server at 25% free swap space, then we don't also want to execute the other actions that are associated with higher levels.

Usually you would use swapwatch from /etc/crontab. For example, I have this line in mine:

*/1 * * * * root swapwatch 95 '90:apache restart' '50:apache stop; mysqladmin shutdown' '25:reboot'

So what happens here?

Let's imagine how a possible scenario might transpire. Say some process has a memory leak and begins eating up RAM, eventually exhausting that and starting on the swap space. Our swapwatch script is being run every minute by cron. So the first condition it notices is probably the '95', i.e. the percentage of free swap has fallen below 95% of total swap. Using a high number for the first warning is probably a good idea, since on most servers, any swap being used at all (beyond a few megabytes) is probably a sign of something wrong. So this initial tripwire is a simple warning, with no associated action; it will just print out the warning text, which cron will forward to the sysadmin via email. Hopefully the admin sees this quickly, and checks the system manually to see what's going on.

If the situation continues (let's say the sysadmin is away or home asleep or whatever). Then the next trip point is '90:apache restart'. This means when the free swap falls below 90% of total swap, we try restarting the apache web server. This is relevant for me, because it's apache that seems to be the problem. Usually, a simple restart should do the trick. The sysadmin will again get an email saying what happened.

Ok, next let's say the problem wasn't apache at all, but some other process. Well, on my server the other big, complex program that's running all the time is mysql, the database server. This is usually pretty stable, but very occasionally it'll run into a problem that makes it go haywire. So we have a check point for '50:apache stop; mysqladmin shutdown' - if the free swap falls below 50% of total swap, we stop the webserver and attempt to cleanly shut down mysql.

The final check point is a catch all: '25:reboot' says that if we have less than 25% free swap space, then by now we've already tried shutting down apache and mysql, so there must be some other issue (or maybe the attempted shutdown of the problem process simply didn't work). We need to reboot the server in an orderly fashion, before the dreaded oomkiller starts to make it unresponsive and makes a hard reset necessary.

You can obviously choose your own warning levels and actions according to your situation; what the script does is fully configurable from the command line. If you want, you can just have it as a warning generator:

swapwatch 50

This will just issue warnings when the free swap goes below 50%. Or you could just do a simple command to give you some diagnostics when the condition arises:

swapwatch '50:netstat -a'

It's completely up to you.

This is a very simple Perl script, and should be hackable by anyone with knowledge of the language. It should only be used by experienced sysadmins, and it's presented here without any guarantee or warranty. It's free for you to use or hack for your own use - but use it at your own risk! All I ask is that if you republish it or adapt it, please give attribution back to me.

Possible enhancements might include making the script into a loop that runs continuously, for situations where running every minute from cron would be too infrequent (e.g. for situations where the memory can become exhausted very quickly). For me, it works well as-is, so I'll leave that as an exercise for the reader.

Thanks,

Neil Gunton

March 22nd 2010