The Linux kernel exposes a wealth of information through the proc special filesystem. It's not hard to find an encyclopedic reference about proc . In this article I'll take a different approach: we'll see how proc tricks can solve a number of real-world problems. All of these tricks should work on a recent Linux kernel, though some will fail on older systems like RHEL version 4.

Almost all Linux systems will have the proc filesystem mounted at /proc . If you look inside this directory you'll see a ton of stuff:

keegan@lyle$ mount | grep ^proc proc on /proc type proc (rw,noexec,nosuid,nodev) keegan@lyle$ ls /proc 1 13 23 29672 462 cmdline kcore self 10411 13112 23842 29813 5 cpuinfo keys slabinfo ... 12934 15260 26317 4 bus irq partitions zoneinfo 12938 15262 26349 413 cgroups kallsyms sched_debug

These directories and files don't exist anywhere on disk. Rather, the kernel generates the contents of /proc as you read it. proc is a great example of the UNIX "everything is a file" philosophy. Since the Linux kernel exposes its internal state as a set of ordinary files, you can build tools using basic shell scripting, or any other programming environment you like. You can also change kernel behavior by writing to certain files in /proc , though we won't discuss this further.

Each process has a directory in /proc , named by its numerical process identifier (PID). So for example, information about init (PID 1) is stored in /proc/1 . There's also a symlink /proc/self , which each process sees as pointing to its own directory:

keegan@lyle$ ls -l /proc/self lrwxrwxrwx 1 root root 64 Jan 6 13:22 /proc/self -> 13833

Here we see that 13833 was the PID of the ls process. Since ls has exited, the directory /proc/13883 will have already vanished, unless your system reused the PID for another process. The contents of /proc are constantly changing, even in response to your queries!

Back from the dead It's happened to all of us. You hit the up-arrow one too many times and accidentally wiped out that really important disk image. keegan@lyle$ rm hda.img Time to think fast! Luckily you were still computing its checksum in another terminal. And UNIX systems won't actually delete a file on disk while the file is in use. Let's make sure our file stays "in use" by suspending md5sum with control-Z: keegan@lyle$ md5sum hda.img ^Z [1]+ Stopped md5sum hda.img The proc filesystem contains links to a process's open files, under the fd subdirectory. We'll get the PID of md5sum and try to recover our file: keegan@lyle$ jobs -l [1]+ 14595 Stopped md5sum hda.img keegan@lyle$ ls -l /proc/14595/fd/ total 0 lrwx------ 1 keegan keegan 64 Jan 6 15:05 0 -> /dev/pts/18 lrwx------ 1 keegan keegan 64 Jan 6 15:05 1 -> /dev/pts/18 lrwx------ 1 keegan keegan 64 Jan 6 15:05 2 -> /dev/pts/18 lr-x------ 1 keegan keegan 64 Jan 6 15:05 3 -> /home/keegan/hda.img (deleted) keegan@lyle$ cp /proc/14595/fd/3 saved.img keegan@lyle$ du -h saved.img 320G saved.img Disaster averted, thanks to proc . There's one big caveat: making a full byte-for-byte copy of the file could require a lot of time and free disk space. In theory this isn't necessary; the file still exists on disk, and we just need to make a new name for it (a hardlink). But the ln command and associated system calls have no way to name a deleted file. On FreeBSD we could use fsdb , but I'm not aware of a similar tool for Linux. Suggestions are welcome!

Redirect harder Most UNIX tools can read from standard input, either by default or with a specified filename of " - ". But sometimes we have to use a program which requires an explicitly named file. proc provides an elegant workaround for this flaw. A UNIX process refers to its open files using integers called file descriptors. When we say "standard input", we really mean "file descriptor 0". So we can use /proc/self/fd/0 as an explicit name for standard input: keegan@lyle$ cat crap-prog.py import sys print file(sys.argv[1]).read() keegan@lyle$ echo hello | python crap-prog.py IndexError: list index out of range keegan@lyle$ echo hello | python crap-prog.py - IOError: [Errno 2] No such file or directory: '-' keegan@lyle$ echo hello | python crap-prog.py /proc/self/fd/0 hello This also works for standard output and standard error, on file descriptors 1 and 2 respectively. This trick is useful enough that many distributions provide symlinks at /dev/stdin , etc. There are a lot of possibilities for where /proc/self/fd/0 might point: keegan@lyle$ ls -l /proc/self/fd/0 lrwx------ 1 keegan keegan 64 Jan 6 16:00 /proc/self/fd/0 -> /dev/pts/6 keegan@lyle$ ls -l /proc/self/fd/0 < /dev/null lr-x------ 1 keegan keegan 64 Jan 6 16:00 /proc/self/fd/0 -> /dev/null keegan@lyle$ echo | ls -l /proc/self/fd/0 lr-x------ 1 keegan keegan 64 Jan 6 16:00 /proc/self/fd/0 -> pipe:[9159930] In the first case, stdin is the pseudo-terminal created by my screen session. In the second case it's redirected from a different file. In the third case, stdin is an anonymous pipe. The symlink target isn't a real filename, but proc provides the appropriate magic so that we can read the file anyway. The filesystem nodes for anonymous pipes live in the pipefs special filesystem — specialer than proc , because it can't even be mounted.

The phantom progress bar Say we have some program which is slowly working its way through an input file. We'd like a progress bar, but we already launched the program, so it's too late for pv . Alongside /proc/$PID/fd we have /proc/$PID/fdinfo , which will tell us (among other things) a process's current position within an open file. Let's use this to make a little script that will attach a progress bar to an existing process: keegan@lyle$ cat phantom-progress.bash #!/bin/bash fd=/proc/$1/fd/$2 fdinfo=/proc/$1/fdinfo/$2 name=$(readlink $fd) size=$(wc -c $fd | awk '{print $1}') while [ -e $fd ]; do progress=$(cat $fdinfo | grep ^pos | awk '{print $2}') echo $((100*$progress / $size)) sleep 1 done | dialog --gauge "Progress reading $name" 7 100 We pass the PID and a file descriptor as arguments. Let's test it: keegan@lyle$ cat slow-reader.py import sys import time f = file(sys.argv[1], 'r') while f.read(1024): time.sleep(0.01) keegan@lyle$ python slow-reader.py bigfile & [1] 18589 keegan@lyle$ ls -l /proc/18589/fd total 0 lrwx------ 1 keegan keegan 64 Jan 6 16:40 0 -> /dev/pts/16 lrwx------ 1 keegan keegan 64 Jan 6 16:40 1 -> /dev/pts/16 lrwx------ 1 keegan keegan 64 Jan 6 16:40 2 -> /dev/pts/16 lr-x------ 1 keegan keegan 64 Jan 6 16:40 3 -> /home/keegan/bigfile keegan@lyle$ ./phantom-progress.bash 18589 3 And you should see a nice curses progress bar, courtesy of dialog . Or replace dialog with gdialog and you'll get a GTK+ window.

Chasing plugins A user comes to you with a problem: every so often, their instance of Enterprise FooServer will crash and burn. You read up on Enterprise FooServer and discover that it's a plugin-riddled behemoth, loading dozens of shared libraries at startup. Loading the wrong library could very well cause mysterious crashing. The exact set of libraries loaded will depend on the user's config files, as well as environment variables like LD_PRELOAD and LD_LIBRARY_PATH . So you ask the user to start fooserver exactly as they normally do. You get the process's PID and dump its memory map: keegan@lyle$ cat /proc/21637/maps 00400000-00401000 r-xp 00000000 fe:02 475918 /usr/bin/fooserver 00600000-00601000 rw-p 00000000 fe:02 475918 /usr/bin/fooserver 02519000-0253a000 rw-p 00000000 00:00 0 [heap] 7ffa5d3c5000-7ffa5d3c6000 r-xp 00000000 fe:02 1286241 /usr/lib/foo-1.2/libplugin-bar.so 7ffa5d3c6000-7ffa5d5c5000 ---p 00001000 fe:02 1286241 /usr/lib/foo-1.2/libplugin-bar.so 7ffa5d5c5000-7ffa5d5c6000 rw-p 00000000 fe:02 1286241 /usr/lib/foo-1.2/libplugin-bar.so 7ffa5d5c6000-7ffa5d5c7000 r-xp 00000000 fe:02 1286243 /usr/lib/foo-1.3/libplugin-quux.so 7ffa5d5c7000-7ffa5d7c6000 ---p 00001000 fe:02 1286243 /usr/lib/foo-1.3/libplugin-quux.so 7ffa5d7c6000-7ffa5d7c7000 rw-p 00000000 fe:02 1286243 /usr/lib/foo-1.3/libplugin-quux.so 7ffa5d7c7000-7ffa5d91f000 r-xp 00000000 fe:02 4055115 /lib/libc-2.11.2.so 7ffa5d91f000-7ffa5db1e000 ---p 00158000 fe:02 4055115 /lib/libc-2.11.2.so 7ffa5db1e000-7ffa5db22000 r--p 00157000 fe:02 4055115 /lib/libc-2.11.2.so 7ffa5db22000-7ffa5db23000 rw-p 0015b000 fe:02 4055115 /lib/libc-2.11.2.so 7ffa5db23000-7ffa5db28000 rw-p 00000000 00:00 0 7ffa5db28000-7ffa5db2a000 r-xp 00000000 fe:02 4055114 /lib/libdl-2.11.2.so 7ffa5db2a000-7ffa5dd2a000 ---p 00002000 fe:02 4055114 /lib/libdl-2.11.2.so 7ffa5dd2a000-7ffa5dd2b000 r--p 00002000 fe:02 4055114 /lib/libdl-2.11.2.so 7ffa5dd2b000-7ffa5dd2c000 rw-p 00003000 fe:02 4055114 /lib/libdl-2.11.2.so 7ffa5dd2c000-7ffa5dd4a000 r-xp 00000000 fe:02 4055128 /lib/ld-2.11.2.so 7ffa5df26000-7ffa5df29000 rw-p 00000000 00:00 0 7ffa5df46000-7ffa5df49000 rw-p 00000000 00:00 0 7ffa5df49000-7ffa5df4a000 r--p 0001d000 fe:02 4055128 /lib/ld-2.11.2.so 7ffa5df4a000-7ffa5df4b000 rw-p 0001e000 fe:02 4055128 /lib/ld-2.11.2.so 7ffa5df4b000-7ffa5df4c000 rw-p 00000000 00:00 0 7fffedc07000-7fffedc1c000 rw-p 00000000 00:00 0 [stack] 7fffedcdd000-7fffedcde000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] That's a serious red flag: fooserver is loading the bar plugin from FooServer version 1.2 and the quux plugin from FooServer version 1.3. If the versions aren't binary-compatible, that might explain the mysterious crashes. You can now hassle the user for their config files and try to fix the problem. Just for fun, let's take a closer look at what the memory map means. Right away, we can recognize a memory address range (first column), a filename (last column), and file-like permission bits rwx . So each line indicates that the contents of a particular file are available to the process at a particular range of addresses with a particular set of permissions. For more details, see the proc manpage. The executable itself is mapped twice: once for executing code, and once for reading and writing data. The same is true of the shared libraries. The flag p indicates a private mapping: changes to this memory area will not be shared with other processes, or saved to disk. We certainly don't want the global variables in a shared library to be shared by every process which loads that library. If you're wondering, as I was, why some library mappings have no access permissions, see this glibc source comment. There are also a number of "anonymous" mappings lacking filenames; these exist in memory only. An allocator like malloc can ask the kernel for such a mapping, then parcel out this storage as the application requests it. The last two entries are special creatures which aim to reduce system call overhead. At boot time, the kernel will determine the fastest way to make a system call on your particular CPU model. It builds this instruction sequence into a little shared library in memory, and provides this virtual dynamic shared object (named vdso ) for use by userspace code. Even so, the overhead of switching to the kernel context should be avoided when possible. Certain system calls such as gettimeofday are merely reading information maintained by the kernel. The kernel will store this information in the public virtual system call page (named vsyscall ), so that these "system calls" can be implemented entirely in userspace.

Counting interruptions We have a process which is taking a long time to run. How can we tell if it's CPU-bound or IO-bound? When a process makes a system call, the kernel might let a different process run for a while before servicing the request. This voluntary context switch is especially likely if the system call requires waiting for some resource or event. If a process is only doing pure computation, it's not making any system calls. In that case, the kernel uses a hardware timer interrupt to eventually perform a nonvoluntary context switch. The file /proc/$PID/status has fields labeled voluntary_ctxt_switches and nonvoluntary_ctxt_switches showing how many of each event have occurred. Let's try our slow reader process from before: keegan@lyle$ python slow-reader.py bigfile & [1] 15264 keegan@lyle$ watch -d -n 1 'cat /proc/15264/status | grep ctxt_switches' You should see mostly voluntary context switches. Our program calls into the kernel in order to read or sleep, and the kernel can decide to let another process run for a while. We could use strace to see the individual calls. Now let's try a tight computational loop: keegan@lyle$ cat tightloop.c int main() { while (1) { } } keegan@lyle$ gcc -Wall -o tightloop tightloop.c keegan@lyle$ ./tightloop & [1] 30086 keegan@lyle$ watch -d -n 1 'cat /proc/30086/status | grep ctxt_switches' You'll see exclusively nonvoluntary context switches. This program isn't making system calls; it just spins the CPU until the kernel decides to let someone else have a turn. Don't forget to kill this useless process!