Even Linux servers can go haywire some days. Here's the first steps you should take in troubleshooting and fixing them.

I've seen plenty of Linux servers run day in and day out for years, with nary a reboot. But any server can suffer from hardware, software, and connectivity problems. Here's how to find out what's wrong so you can get them working again.

One pre-troubleshooting issue is the meta-question of whether you should fix the server at all.

When I started as a Unix system administrator in the 1980s—long before Linux was a twinkle in Linus Torvalds' eye—if a server went bad, you had a real problem. There were relatively few debugging tools, so it could take a long time to get a malfunctioning server back into production.

Why troubleshooting is different now

It's different today. One sysadmin told me quite seriously that he'd "blow it away and build another one."

In a world where IT is built around virtual machines (VMs) and containers, this makes sense. The cloud, after all, depends on being able to roll out new instances as needed.

Plus, DevOps tools such as Chef and Puppet make it easier to start over than to fix anything. With higher level DevOps tools such as Docker Swarm, Mesosphere, and Kubernetes, your servers can go down and be brought back up before you even know they failed.

This concept has become so widespread that it has a name: serverless computing, which includes AWS Lambda, Iron.io, and Google Cloud Functions. With this technique, the cloud service handles all the capacity, scaling, patching, and administration of the server you need to run your program.

While serverless computing makes servers invisible to users and, to some extent, sysadmins, underneath all those layers of abstraction—VMs, containers, serverless—you still have physical hardware and the operating system. And at the end of the day, someone still has to fix them when things break.

As one system operator told me, "'Just reinstall it' is a terrible practice. It doesn't tell you anything about why the server broke or how to prevent it from breaking again. No halfway-decent admin should start with a reinstall."

I agree. Until you actually work out why a problem happened in the first place, the issue isn't resolved.

Here's my suggestions on how to start that process.

Over 1M people read enterprise.nxt. Are you one of them? Subscribe today

1. Check the hardware!

First—and I know this is going to sound really stupid, but do it anyway—check the hardware. In particular, go to the rack in person and make sure all the cables are plugged in correctly.

I cannot begin to count the number of times a problem could be tracked back to cables when just a quick glance at the blinkenlights could have told you the power was off or a network cable had come unplugged.

Of course, you don't have to look at the hardware. For example, this shell command tells you if your Ethernet device link is detectable:

$ sudo ethtool eth0

If the answer is yes, you know the port is talking to the network.

Yet it’s a good idea to physically look at the gear to make sure someone didn't pull the Big Red Switch and turn off the server or rack's power. Yes, this is simple, but it's amazing how many times you can thumb-finger a total system outage.

Other common hardware problems can't be spotted by a mark one eyeball. For example, bad RAM causes all kinds of problems. VMs and containers can hide these problems, but if you see a pattern of failures linked to a specific bare-metal server, check its memory.

To see what a server's BIOS/UEFI reports about its hardware, including memory, use the dmidecode command:

$ sudo dmidecode --type memory

If this looks right—it may not be, as SMBIOS data isn't always accurate—and you still suspect a memory problem, it's time to deploy Memtest86. This is the essential memory checking program, but it's slow. If you're running it on a server, don't expect to use that machine for anything else while the checks are running.

If you run into a lot of memory problems—which I've seen in places with dirty power—you should load the edac_core module. This Linux kernel module constantly checks for bad memory. To load it, use the command:

$ sudo modprobe edac_core

Wait for a while and then check to see if anything shows up when you type in the command:

$ sudo grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

This presents you with a list of the memory controller's row (DIMM) and error count. Combined with dmidecode data on memory channel, slot, and part number, this process helps you find the corrupted memory stick.

2. Define the exact problem

OK, so your server has gone haywire, but there's no magic smoke coming out of it. Before you attempt to deal with the end result, you need to lock down exactly what the problem is. For example, if your users are complaining about a problem with a server application, first make sure it's not actually failing at the client.

For instance, a friend told me his users reported that IBM Tivoli Storage Manager had failed for them. Eventually, he discovered the problem wasn't on the server side at all. Instead, it was a bad Windows client patch, 3076895. But the manner in which the security patch messed up made it look like a problem on the server side.

You should also determine whether the problem is with the server per se or the server application. For example, a server program can go awry while the server keeps humming along.

There are numerous ways to check to see if an application is running. Two of my favorites are:

$ sudo ps -ef | grep apache2

$ sudo netstat -plunt | grep apache2

If it turns out that, say, the Apache web server isn't running, you can start it with this:

$ sudo service apache2 start

In short, before jumping in to work out what's wrong, make sure you work out which element is at fault. Only once you're sure you know what a problem is do you know the right questions to ask or the next level of troubleshooting to investigate.

I mean, sure, you know your car doesn't run, but first you need to make sure there's gas in the tank before hauling the car off to the shop for repairs.

3. Top

Another useful system debugging step is top, to check load average, swap, and which processes are using resources. Top shows all of a Linux server's currently running processes.

Specifically, top displays:

Line 1:

The time

How long the computer has been running

Number of users

Load average (the system load time for the last minute, last 5 minutes, and last 15 minutes)

Line 2:

Total number of tasks

Number of running tasks

Number of sleeping tasks

Number of stopped tasks

Number of zombie tasks

Line 3:

CPU usage as a percentage by the user

CPU usage as a percentage by system

CPU usage as a percentage by low-priority processes

CPU usage as a percentage by idle processes

CPU usage as a percentage by I/O wait

CPU usage as a percentage by hardware interrupts

CPU usage as a percentage by software interrupts

CPU usage as a percentage by steal time

Total system memory

Free memory

Memory used

Buffer cache

Line 4:

Total swap available

Total swap free

Total swap used

Available memory

This is followed by a line for each running application. It includes:

Process ID

User

Priority

Nice level

Virtual memory used by process

Resident memory used by process

Shareable memory

CPU used by process as a percentage

Memory used by process as a percentage

Time process has been running

Command

That’s a wealth of useful troubleshooting information. Here are some useful ways to get at it.

To find the process consuming the most memory, sort the process list by pressing the M key. To see which applications are using the most CPU, press P ; and to sort by running time, press T . To more easily see which column you're using for sorting, press the b key.

You can also interactively filter top's results by pressing o or O, which displays the following prompt:

add filter #1 (ignoring case) as: [!]FLD?VAL

You can then enter a search for a particular process, for example, COMMAND=apache , whereupon top displays only Apache processes.

Another useful top command is to display each process’s full command path and arguments. To do this, press c .

A related top command is Forest mode, which you activate with V . This displays the processes in a parent-child hierarchy.

You can also view a specific user's processes with u or U, o r get rid of the idle processes' display with i .

While top has long been the most popular Linux interactive activity viewer, htop adds even more features and has an easier graphical Ncurses interface. For example, with htop you can use the mouse and scroll the process list vertically and horizontally to see all processes and complete command lines.

I don’t expect top to tell me what the problem is; rather, I use it to find behavior that makes me say, “That’s funny,” and inspires further investigation. Based on what top tells me, I know which logs to look at first. The logs themselves I inspect using combinations of less , grep , and tail -f .

4. What's up with disk space?

Even today, when you can carry a terabyte in your pocket, a server can run out of disk space without anyone noticing. When that happens, really wonky problems can show up.

To track these down, the good old df command—which stands for “disk filesystem”—is your friend. You use df to view a full summary of available and used disk space.

It's typically used in two ways:

$ sudo df -h presents data about your hard drives in a human-readable format. For example, it displays storage as gigabytes (G) rather than an exact number of bytes.

presents data about your hard drives in a human-readable format. For example, it displays storage as gigabytes (G) rather than an exact number of bytes. $ sudo df -i displays the number of used inodes and their percentage for the file system.

Another useful flag is T. This displays your storage's file system types. So, $ sudo df -hT shows both the amount of used space in your storage and its file system type.

If something seems off, you can look deeper by using the command Iostat. This command is part of the sysstat advanced system performance monitoring tools collection. It reports on CPU statistics and I/O statistics for block storage devices, partitions, and network file systems.

Perhaps the most useful version of this command is:

$ iostat -xz 1

This displays the delivered reads, writes, read KB, and write KB per second to the device. It also shows you the average time for the I/O in milliseconds (await). The bigger the await number, the more likely it is that the drive is saturated with data requests, or it has a hardware problem. Which is it? You might use top to see if MySQL (or whatever DBMS you're using) is keeping your server busy. If there's no application burning the midnight oil, then chances are your drive is turning sour.

Another important result is found under %util , which measures device utilization. This shows how hard the device is doing work. Values greater than 60% indicate poor storage performance. If the value is close to 100%, the drive is nearing saturation.

Be careful of what you're looking at. A logical disk device fronting multiple back-end disks with 100% utilization may just mean that some I/O is always being processed. What matters is what's happening on those back-end disks. So, when you're looking at a logical drive, keep in mind that the disk utilities aren't going to giving you useful information.

5. Check the logs

Last, but never least, check the server logs. These are usually in /var/log in a subdirectory specific to the service.

For Linux newcomers, log files can be scary. They record in text files everything Linux or Linux-based applications do. There are two kinds of log records. One records what happens on a system or in a program, such as every transaction or data movement. The other records system or application error messages. Log files may contain both. They can be enormous files.

Log file data tends to be cryptic, but you still need to learn your way around them. Digital Ocean's "How to View and Configure Linux Logs on Ubuntu and Centos" is an excellent introduction.

There are many tools to help you check logs.

One useful troubleshooting tool is dmesg. This displays all the kernel messages. That's usually way too many, so use this simple shell script to display the last 10 messages:

$ dmesg | tail .

Want to see what's happening as it happens? I know I do when I'm troubleshooting. Then run tail with the -f option:

$ dmesg | tail -f /var/log/syslog

With the above command, tail continues to keep an eye on the syslog file and prints out the next event recorded to syslog.

Another handy simple shell script is:

$ sudo find /var/log -type f -mtime -1 -exec tail -Fn0 {} +

This sweeps through the logs and shows possible problems.

In the unlikely chance you're using a server using systemd for its system and server management, you need to use its built-in log tool, Journalctl. Systemd centralizes log management with the journald daemon. Unlike older Linux logs, journald stores data in a binary rather than text format.

You can set journald to save logs from one reboot to the other with the command:

$ sudo mkdir -p /var/log/journal

You need to enable persistent record keeping by editing /etc/systemd/journald.conf to include the line:

[Journal] Storage=persistent

The most common way to access this log data is with the command:

journalctl -b

This shows you all the journal entries since the most recent reboot. If your system required a reboot, you can track what happened the last time by using the command:

$ journalctl -b -1

This looks at the log from the server's last session.

For more of an introduction on how to use journalctl, see "How to Use Journalctl to View and Manipulate Systemd Logs."

Logs can be huge and difficult to work with. So, while you can work through them with shell scripts using grep, awk, and other filters, you may also want to use a log-viewing tool.

A favorite of mine is Graylog, an open source log management system. It collects, indexes, and analyzes framed, systematic, and disorganized data. To do this, it uses MongoDB for data and Elasticsearch log file searches. Graylog makes it easy to track what's what with your servers. It makes working with logs easier than with Linux's built-in log tools. It also has the advantage of working with multiple DevOps programs, such as Chef, Puppet, and Ansible.

Maybe your servers will never reach all time longevity records. But fixing problems and setting servers to be as stable as possible are always worthwhile goals. With all these methods, you should be well on your way to finding and fixing your problem.

Related link:

The Importance of Business Continuity in Your SAP HANA Environment