Troubleshooting Steps For When Your Hard Disk Is Giving You Trouble

When we’re talking about server performance, one of the more difficult issues we run into is the disk performance troubleshooting. While CPU load and memory usage can both can be monitored quite easily, disk overload can have load peaks that are hard to see and over time greatly affect overall server performance. Before you go deeper into your server storage system performance, it's a good idea to first look to the basics: does the server have sufficient free storage space and inodes count? That can be checked with the commands ‘df -h’ and ‘df -ih’ as shown below:

root@serversuit:~# df -h Filesystem Size Used Avail Use% Mounted on rootfs 30G 1.3G 27G 5% / udev 10M 0 10M 0% /dev tmpfs 101M 144K 101M 1% /run /dev/disk/by-label/DOROOT 30G 1.3G 27G 5% / tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 201M 0 201M 0% /run/shm root@serversuit:~# df -ih Filesystem Inodes IUsed IFree IUse% Mounted on rootfs 1.9M 36K 1.9M 2% / udev 124K 275 124K 1% /dev tmpfs 126K 195 126K 1% /run /dev/disk/by-label/DOROOT 1.9M 36K 1.9M 2% / tmpfs 126K 1 126K 1% /run/lock tmpfs 126K 2 126K 1% /run/shm

ServerSuit also has a widget you can add to track disk space usage.

So, when is it a good time to check on your server disk performance? If get sudden lag spikes or high load average numbers on your server, or you can see a high wait average metric from your ‘top’ output, you should probably check advanced disk performance information. We can start with ‘iostat’ command at bash console (you’ll need the ‘sysstat’ package installed). Let’s look for the output:

root@serversuit:~# iostat -xcd Linux 3.2.0-4-amd64 (serversuit) 04/28/2016 _x86_64_ (1 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.4 7 0.0 0.22 0.46 0.02 98.83 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util vda 20.70 47.44 0.91 12.01 26.18 237.91 40.89 2.35 181.49 7.76 10.03

This output will get you averaged stats from the time of boot of the server, which is not a silver bullet but can give you a basic understanding of what's been going on. To look for real-time data you’ll need to run ‘iostat -xcd -t 10’ command, which will return 10-seconds averages. You should pay attention to at least these metrics: rrqm/s - reads per second requested from your apps; wrqm/s - writes per second requested from the apps, which will give you your IOPS summary; r/s - actual reads from the storage device; w/s - actual writes from the storage device; await - average latency of all requests. If we’ll look at the numbers, we can draw some conclusions. Read requests are cached and were effectively merged: for 20.7 read IO requests only 1 IO request to the device was actually executed. Write requests were either random or can’t be cached, so for 47 write IO queries - 12 actual IO requests was executed. Average throughput was not huge, so there was probably some random IO and some low-level storage device underneath. Average latency is not good, so looks like we have a problem here. You’ll need to run ‘iostat’ with ‘-t’ key for some time to have real-time data about your storage load, so you can see IO peaks too, not just averages. When you have all the data you need, you can look for your current storage activity and probably have some graphs which will show you storage load peaks. It can be hard because you can’t say for sure which services are using disk resources with this tool. That’s where you can use the ‘iotop’ utility: this tool can help you to look for processes and their IO activity in real-time:

Total DISK READ: 0.00 B/s | Total DISK WRITE: 23.42 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 137 be/3 root 0.00 B/s 15.62 K/s 0.00 % 0.05 % [jbd2/vda1-8] 30040 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % ssh 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init [2] 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd] 3 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] 2052 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % sh /usr/bin/mysqld_safe 5 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/u:0] 6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0]

Overall disk performance troubleshooting is relatively difficult for every system administrator, but hopefully this handy guide gave you some useful options for troubleshooting. As mentioned before, to ease your administrative burden ServerSuit has basic disk monitoring available. However, we're currently working a a feature set to fully troubleshoot disk performance, so look for that in an upcoming release in the near future!

Until next time!