Sometimes as a sysadmin the logfiles just don't cut it, and to solve a problem you need to know what's really going on. That's when I turn to strace -- the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace , that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn't need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS.

It speaks in a single, common vocabulary of directories and files, and

the operating system translates that abstract vocabulary into whatever

has to be done with the actual underlying hardware you have.

Multiplexing means that your programs and mine each get fair access to

the hardware, and don't have the ability to step on each other -- which

means your program can't be permitted to skip the kernel, and speak raw

SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write() . Talk on the network? You need the syscalls socket() , connect() , and again read() and write() . Make more processes? First clone() (inside the standard C library function fork() ), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4() , kill() , pipe() , and a host of others. Even looking at the clock requires a system call, clock_gettime() . Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a

telltale system call is pure computation -- using the CPU and RAM and

nothing else. As a former algorithms person, that's what I used to

think was the fun part. Fortunately for us as sysadmins, very few

real-life programs spend very long in that pure realm between having to

deal with a file or the network or some other part of the system, and

then strace picks them up again.

Let's look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program's Actual Behavior



One day, I wanted to know which Git

commands take out a certain lock -- I had a script running a series of

different Git commands, and it was failing sometimes when run

concurrently because two commands tried to hold the lock at the same

time.

Now, I love sourcediving,

and I've done some Git hacking, so I spent some time with the source

tree investigating this question. But this code is complex enough that I

was still left with some uncertainty. So I decided to get a plain,

ground-truth answer to the question: if I run " git diff ", will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock . Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

$ strace git status 2>&1 >/dev/null | grep index.lock open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3 rename(".git/index.lock", ".git/index") = 0 $ strace git diff 2>&1 >/dev/null | grep index.lock $

So git status takes the lock, and git diff doesn't.

Interlude: The Toolbox



To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what's going on.

In Medias Res, If You Want



Sometimes we don't have the luxury of starting a program over to run it under strace -- it's running, it's misbehaving, and we need to find out what's going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question -- I find pstree -p invaluable for identifying this -- and strace will attach to that program, while it's running, and start telling you all about it.

Times



When I use strace , I almost always pass the -tt option. This tells me when each syscall happened -- -t prints it to the second, -tt

to the microsecond. For system administration problems, this often

helps a lot in correlating the trace with other logs, or in seeing where

a program is spending too much time.

For performance issues, the -T option comes in handy too -- it tells me how long each individual syscall took from start to finish.

Data



By default strace already prints the strings that the

program passes to and from the system -- filenames, data read and

written, and so on. To keep the output readable, it cuts off the

strings at 32 characters. You can see more with the -s option -- -s 1024 makes strace print up to 1024 characters for each string -- or cut out the strings entirely with -s 0 .

Sometimes you want to see the full data flowing in just a few

directions, without cluttering your trace with other flows of data.

Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server,

and you want to see the SQL queries, but not the voluminous data that

comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

$ strace -p 9026 Process 9026 attached - interrupt to quit read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52 read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52 [...]

Those write() syscalls are the SQL queries -- we can make out the SELECT foo FROM bar , and then it trails off. To see the rest, note the file descriptor the syscalls are happening on -- the first argument of read() or write() , which is 3 here. Pass that file descriptor to -e write= :

$ strace -p 9026 -e write=3 Process 9026 attached - interrupt to quit read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52 | 00000 30 00 00 00 03 53 45 4c 45 43 54 20 74 69 6d 65 0....SEL ECT time | | 00010 73 74 61 6d 70 20 46 52 4f 4d 20 61 72 74 69 66 stamp FR OM artif | | 00020 61 63 74 73 20 57 48 45 52 45 20 69 64 20 3d 20 acts WHE RE id = | | 00030 31 34 35 34 1454 |

and we see the whole query. It's both printed and in hex, in case

it's binary. We could also get the whole thing with an option like -s 1024 , but then we'd see all the data coming back via read() -- the use of -e write= lets us pick and choose.

Filtering the Output



Sometimes the full syscall trace is too much -- you just want to see

what files the program touches, or when it reads and writes data, or

some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write()

and friends, which mention file descriptors), or name individual system

calls by hand. We'll use this option in the next example.

Child Processes



Sometimes the process you trace doesn't do the real work itself, but

delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that's the case, you may want to pass -f to make strace "follow forks" and trace child processes, too, as soon as they're made.

For example, here's a trace of a simple shell script, without -f :

$ strace -e trace=process,file,desc sh -c \ 'for d in .git/objects/*; do ls $d >/dev/null; done' [...] stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948 --- SIGCHLD (Child exited) @ 0 (0) -- wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here -- all the real work was done inside process 11948, the one created by that clone() syscall.

Here's the same script traced with -f (and the trace edited for brevity):

$ strace -f -e trace=process,file,desc sh -c \ 'for d in .git/objects/*; do ls $d >/dev/null; done' [...] stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0 clone(Process 10738 attached child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738 [pid 10682] wait4(-1, Process 10682 suspended [pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 [pid 10738] dup2(3, 1) = 1 [pid 10738] close(3) = 0 [pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0 [... setup of C standard library omitted ...] [pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 [pid 10738] getdents(3, /* 40 entries */, 4096) = 2480 [pid 10738] getdents(3, /* 0 entries */, 4096) = 0 [pid 10738] close(3) = 0 [pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482 [pid 10738] close(1) = 0 [pid 10738] close(2) = 0 [pid 10738] exit_group(0) = ? Process 10682 resumed Process 10738 detached <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738 --- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself -- future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents() .

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff , which makes strace write each process's trace into a separate file.

Use #2: Why/Where Is A Program Stuck?



Sometimes a program doesn't seem to be doing anything. Most often,

that means it's blocked in some system call. Strace to the rescue.

$ strace -p 22067 Process 22067 attached - interrupt to quit flock(3, LOCK_EX

Here it's blocked trying to take out a lock, an exclusive lock ( LOCK_EX ) on the file it's opened as file descriptor 3. What file is that?

$ readlink /proc/22067/fd/3 /tmp/foobar.lock

Aha, it's the file /tmp/foobar.lock . And what process is holding that lock?

$ lsof | grep /tmp/foobar.lock command 21856 price 3uW REG 253,88 0 34443743 /tmp/foobar.lock command 22067 price 3u REG 253,88 0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why

21856 has been holding the lock for so long, whether 21856 and 22067

really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace :

Waiting on the network. Use lsof again to see the remote hostname and port.

Use lsof again to see the remote hostname and port. Trying to read a directory. Don't laugh -- this can

actually happen when you have a giant directory with many thousands of

entries. And if the directory used to be giant and is now small again,

on a traditional filesystem like ext3 it becomes a long list of "nothing

to see here" entries, so a single syscall may spend minutes scanning

the deleted entries before returning the list of survivors.

Don't laugh -- this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of "nothing to see here" entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors. Not making syscalls at all. This means it's doing some pure computation, perhaps a bunch of math. You're outside of straces domain; good luck.

Uses #3, #4, ...



A post of this length can only scratch the surface of what strace can do in a sysadmin's toolbox. Some of my other favorites include

As a progress bar. When a program's in the middle of a long task and you want to estimate if it'll be another three hours or three days, strace

can tell you what it's doing right now -- and a little cleverness can

often tell you how far that places it in the overall task.

When a program's in the middle of a long task and you want to estimate if it'll be another three hours or three days, can tell you what it's doing right now -- and a little cleverness can often tell you how far that places it in the overall task. Measuring latency. There's no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.

There's no better way to tell how long your application takes to talk to that remote server than watching it actually from the server, with as your stopwatch. Identifying hot spots. Profilers are great, but

they don't always reflect the structure of your program. And have you

ever tried to profile a shell script? Sometimes the best data comes

from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.

Profilers are great, but they don't always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a run to a file, and picking through to see when each phase of your program started and finished. As a teaching and learning tool. The user/kernel

boundary is where almost everything interesting happens in your system.

So if you want to know more about how your system really works -- how

about curling up with a set of man pages and some output from strace ?

~price