We’ve had a number of people ask us recently what sort of procedures and tricks we use when hunting down problems on systems we maintain, as a lot of the work can seem magical at times. While there’s no short answers to these sorts of questions (you could fill many, many pages with the topic), we thought we’d share some of the most commonly-used tools with you.

You may already know a couple of them, which is great. Their real value comes from knowing how and when to use them.

strace

This is the big one for us, because it so frequently tells us exactly what we need to know. strace hooks into a process using the kernel’s ptrace facility and prints a list of all the syscalls made by the process. Because processes can’t do much except crunch numbers without making syscalls, you very quickly get insight into just what a process is doing.

Common things like opening a file, mapping some memory, writing to a file descriptor, accessing the network – all these require syscalls. The output of strace tends to be very verbose, but you don’t need to understand everything for it to be helpful. Most of the time, depending on your problem, you’re looking for just a couple of important syscalls.

As a really simple and contrived example, let’s say that you have a program (possibly named cat ) that’s meant to read a file and print it out, but for some reason is completely silent and doesn’t tell you why it failed. You might run it with strace and look for signs of the problem near the end of the output, like so:

[email protected]:~% strace cat my_file ... ... lots of output here as the shell sets up the execution environment ... ... close(3) = 0 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 open("my_file", O_RDONLY) = -1 ENOENT (No such file or directory) close(1) = 0 exit_group(1) = ?

We’ve trimmed a little bit of the output, but you can see that my_file doesn’t exist because the call to open() fails, and the program terminates without saying anything.

This is a heavily simplified example to demonstrate how you can get a peek under the hood and see what’s really happening when things don’t work as expected. A more realistic use-case might be to inspect a live process and see why it’s not going anywhere. Note that in this example, we use the -p flag to attach to an already-running process (PID 2501) and see what it’s in the middle of.

[email protected]:~% strace -p 2501 Process 2501 attached - interrupt to quit flock(42, LOCK_EX

The process is blocked while waiting for an exclusive lock on file-descriptor 42. This may tell us what we need to know, or we may need to do a little digging in /proc to reveal what it’s waiting for. Once you’re familiar with your systems, the name of the syscall alone will often tell you what you need to know to continue your diagnosis.

lsof

lsof is another firm favourite in the toolbox thanks to its versatility. lsof is short for “list open files”, which is exactly what it does. Without any arguments it’ll tell you about everything on the system, but that’s often not too useful.

We’re usually asking one of two questions when we use lsof:

What files does this process have open? What files are open inside a given directory?

In the first instance we want to know what files are being read/written while the process is running, often they’re log files. This is great when you know a process should be logging its activity, but you can’t find the logfile because the config is a horrible mess of XML files. As a bonus, network connections also appear in the output from lsof, making it easy to look for activity.

lsof also prints all the libraries in use by the process, but these can usually be ignored. The 4th column list file-descriptor numbers, which is what we really care about. Here’s part of the output from the pidgin IM client:

[email protected]:~% lsof -p 4534 COMMAND PID USER FD TYPE SIZE/OFF NODE NAME pidgin 4534 riri 0r CHR 0t0 963 /dev/null pidgin 4534 riri 1u REG 10367 3962676 /home/riri/.xsession-errors (nfs1.shirakiin.net:/home) pidgin 4534 riri 2u REG 10367 3962676 /home/riri/.xsession-errors (nfs1.shirakiin.net:/home) pidgin 4534 riri 3r unix 0t0 1253085 socket pidgin 4534 riri 4u unix 0t0 1253086 socket pidgin 4534 riri 7r IPv4 0t0 TCP ayakashi.shirakiin.net:35883->ldap.shirakiin.net:ldap (ESTABLISHED) pidgin 4534 riri 8w unix 0t0 1253100 socket pidgin 4534 riri 9r unix 0t0 1253108 socket pidgin 4534 riri 10w IPv4 0t0 TCP ayakashi.shirakiin.net:33796->msg.shirakiin.net:xmpp (ESTABLISHED) pidgin 4534 riri 15u FIFO 0t0 1253113 pipe pidgin 4534 riri 16u FIFO 0t0 1253113 pipe pidgin 4534 riri 24u IPv4 0t0 TCP ayakashi.shirakiin.net:40821->msg.shirakiin.net:xmpp-client (ESTABLISHED) pidgin 4534 riri 30u REG 23771 22667516 /home/riri/.purple/logs/jabber/[email protected]/[email protected]/2012-06-29.112723+1000EST.txt (nfs1.shirakiin.net:/home) pidgin 4534 riri 31u REG 26468 26479347 /home/riri/.purple/logs/jabber/[email protected]/[email protected]/2012-06-29.112723+1000EST.txt (nfs1.shirakiin.net:/home)

In addition identifying the active network connections, lsof handily tells us which logfiles are open, at the end of the list.

As mentioned, lsof can also be used to find open files in a directory. This is useful when the directory is actually a mountpoint for a filesystem, and we’re trying to find out what’s stopping us from unmounting the filesystem.

[email protected]:~% umount /mnt/usb_drive umount: /mnt/usb_drive: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) [email protected]:~% lsof +D /mnt/usb_drive COMMAND PID USER FD TYPE SIZE/OFF NAME zsh 15660 riri cwd DIR 12288 /mnt/usb_drive vim 15750 riri cwd DIR 12288 /mnt/usb_drive vim 15750 riri 5u REG 16384 /mnt/usb_drive/.love_letter.tex.swp

Whoops! I forgot I was in the middle of writing that…

ldd

ldd tells you which shared libraries a binary depends on. We use it pretty rarely but it’s handy to confirm your suspicions sometimes – let’s look at a real case we saw recently.

A customer’s server runs Postfix, like almost all of our machines, and was failing to deliver mail to some destinations. The postfix logs showed that the the smtp client process was bombing out, and a little more diagnosis revealed that the problem only occurred when the destination supported TLS encryption.

Google eventually led us to a post on the postfix-users mailing list describing the exact same symptoms. Wietse suggested checking the library dependencies using ldd, which revealed the same apparent cause as well: the Percona mysql daemon.

Too easy! That pretty much nailed it as far as we were concerned, based on the rest of the thread. However, to verify the hypothesis, we’ll use the next scheduled maintenance window to try some vanilla mysql packages and see if the problem goes away. Until then, we’ve disabled opportunistic encryption for outbound mail and everything is working nicely.

netstat

netstat is a handy little tool for inspecting network socket activity. It’s fairly well known thanks to its inclusion in Windows, but it’s much more useful on Linux.

We mostly use netstat to inspect listening ports on a system. One of the lesser known features of netstat is to get the process ID and name of a listener – an example will best demonstrate this.

We suspect this machine is running a hidden IRC server and participating in a botnet, thanks to a tip-off from our netsec division. The flags we use ( -tnlp ) respectively mean: TCP, Numeric output, Listening sockets only, Process details.

[email protected]:~# netstat -tnlp Active Internet connections (only servers) Proto Local Address Foreign Address State PID/Program name tcp 127.0.0.1:3306 0.0.0.0:* LISTEN 1890/mysqld tcp 0.0.0.0:21 0.0.0.0:* LISTEN 1713/vsftpd tcp 0.0.0.0:22 0.0.0.0:* LISTEN 1863/sshd tcp 127.0.0.1:25 0.0.0.0:* LISTEN 2202/master tcp 0.0.0.0:5666 0.0.0.0:* LISTEN 1743/nrpe tcp 0.0.0.0:6667 0.0.0.0:* LISTEN 1720/.zz

That last process looks mighty suspicious! Poking through procfs, we can find the location of the executable and uproot it properly.

[email protected]:~# ls -lh /proc/1720/exe lrwxrwxrwx 1 root root 0 Jun 27 21:12 /proc/1720/exe -> /tmp/a/.zz

Find any other mess it might’ve left behind, kill the process and then nuke it all from orbit. Mission accomplished.

tail

I know what you’re thinking, “tail isn’t that cool, it’s like having head but upside-down”. We say otherwise.

tail is invaluable for reading logfiles while performing testing, by invoking it with the -f flag to follow the end of the file as it’s written to.

It’s probably most commonly used here at Anchor to hunt down mail problems. Once running, we’ll go ahead and send test mails through the system, watching for them to be processed and logged.

tail -f /var/log/maillog

tail can be used to follow multiple logfiles simultaneously as well, which is helpful for processes that split their output to multiple files based on loglevel (errors vs. info) or destination (eg. apache’s access_log and error_log). Used in the right situations, tail-following makes diagnosis much, much faster and more convenient for the sysadmin.

There you have it, that’s our Top-5 for Linux systems. Unfortunately they can’t solve your problems for you, but being proficient in their usage will go a long way when it comes to diagnosing a problem without an obvious cause.

Disagree? Do you think there are other tools that should’ve taken a spot on this list? Leave us a comment or get in touch, we’d love to hear from you.