When developing a piece of software or monitoring a running system both telemetry and context are important. After one understands what is normal behaviour in a historical context the two most pressing questions are often (1) what's changed? and (2) what's acting abnormally?

In this post I'm going to look at three popular tools often used for ad-hoc monitoring as well as look at a simplistic solution for monitoring distributed systems.

top In virtually any modern UNIX-like system you can type top and see a variety of system performance metrics updating every few seconds. $ top -b -n2 -d5 top - 09:43:05 up 1:08, 0 users, load average: 0.52, 0.58, 0.59 Tasks: 4 total, 1 running, 3 sleeping, 0 stopped, 0 zombie %Cpu0 : 4.1 us, 22.2 sy, 0.0 ni, 72.3 id, 0.0 wa, 1.4 hi, 0.0 si, 0.0 st %Cpu1 : 4.3 us, 7.1 sy, 0.0 ni, 87.7 id, 0.0 wa, 0.9 hi, 0.0 si, 0.0 st %Cpu2 : 4.4 us, 9.0 sy, 0.0 ni, 85.3 id, 0.0 wa, 1.2 hi, 0.0 si, 0.0 st %Cpu3 : 3.6 us, 6.7 sy, 0.0 ni, 88.6 id, 0.0 wa, 1.0 hi, 0.0 si, 0.0 st KiB Mem: 33431016 total, 9521052 used, 23909964 free, 34032 buffers KiB Swap: 62455548 total, 27064 used, 62428484 free. 188576 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 8304 132 104 S 0.0 0.0 0:00.14 /init ro 3 root 20 0 8308 96 56 S 0.0 0.0 0:00.00 /init ro 4 mark 20 0 17856 5308 5192 S 0.0 0.0 0:00.35 -bash 228 mark 20 0 14452 1668 1172 R 0.0 0.0 0:00.01 top -b -n2 -d5 The binary running is most like a version of top written by James Warner of Comcast. This version of top is entirely new and was built as a replacement to a previous version written by developers from a variety of organisations including Lockheed Martin and Heidelberg University. The top.c source code itself is reasonably simplistic and as of this writing was around ~4,900 lines of C code. Top is still in active development to this day and its source code can be seen with the rest of the procps repository on GitLab. Other utilities found in this repo include kill , ps , sysctl , uptime and watch . The default layout feels timeless to me but over the decades I've been working with UNIX systems I've developed a muscle memory for typing zc1M every time I bring up top on a new machine. Top uses a monochrome display mode by default so z will toggle into a colour-mapping mode. The number 1 will display separate CPU states and does a good job at highlighting single CPU core-bound loads. I like to view processes sorted by their pressure on memory capacity by typing M . In total there are 49 metrics top can view and sort on. Commands are truncated by default and typing c will give more extended information on their paths and arguments. My only complaint with this is that it's the end of the commands and arguments that are truncated; it would be more useful to just keep the beginning and end of each command and argument string in order to differentiate between processes. The changes to top's configuration will only last as long as the session. To avoid this, type uppercase W and it'll save the current configuration to ~/.toprc by default. My only annoyance with this file is that it contains byte values above 0x7F and isn't easy to edit outside of top. $ hexdump -C ~/.toprc | head 00000000 74 6f 70 27 73 20 43 6f 6e 66 69 67 20 46 69 6c |top's Config Fil| 00000010 65 20 28 4c 69 6e 75 78 20 70 72 6f 63 65 73 73 |e (Linux process| 00000020 65 73 20 77 69 74 68 20 77 69 6e 64 6f 77 73 29 |es with windows)| 00000030 0a 49 64 3a 69 2c 20 4d 6f 64 65 5f 61 6c 74 73 |.Id:i, Mode_alts| 00000040 63 72 3d 30 2c 20 4d 6f 64 65 5f 69 72 69 78 70 |cr=0, Mode_irixp| 00000050 73 3d 31 2c 20 44 65 6c 61 79 5f 74 69 6d 65 3d |s=1, Delay_time=| 00000060 33 2e 30 2c 20 43 75 72 77 69 6e 3d 30 0a 44 65 |3.0, Curwin=0.De| 00000070 66 09 66 69 65 6c 64 73 63 75 72 3d a5 a8 b3 b4 |f.fieldscur=....| 00000080 bb bd c0 c4 b7 ba b9 c5 26 27 29 2a 2b 2c 2d 2e |........&')*+,-.| 00000090 2f 30 31 32 35 36 38 3c 3e 3f 41 42 43 46 47 48 |/012568<>?ABCFGH|

Htop In 2004, Hisham Muhammad began work on creating a distinctly different systems telemetry monitor. Htop put a focus on telemetry display organisation. There are bar charts for key CPU and memory metrics, processes can toggle between a flat list and a hierarchy via the F5 shortcut, field sorted can be done via mouse clicks and there are seven different colour modes are supported. The software does a good job of keeping you within the application. If you want to inspect the files a process is using you can select the process and simply type l , if you want to run the process through strace simply type s while running htop as a privileged user. Below will install and run htop on Ubuntu 16.04.2 LTS. $ sudo apt install htop $ htop 1 [ 0.0%] Tasks: 37, 145 thr; 1 running 2 [ 0.0%] Load average: 0.03 0.05 0.07 3 [ 0.0%] Uptime: 01:31:42 4 [ 0.0%] Mem[|||||||||||||||||||||||||||||||| 1.03G/3.84G] Swp[ 0K/4.00G] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 1 root 20 0 37556 5668 4004 S 0.0 0.1 0:03.03 /sbin/init noprompt 27884 clickhous 20 0 3716M 359M 49184 S 0.7 9.1 0:24.93 ├─ /usr/bin/clickhouse-server --config=/etc/cli 29668 clickhous 20 0 3716M 359M 49184 S 0.0 9.1 0:00.10 │ ├─ /usr/bin/clickhouse-server --config=/etc/ 29667 clickhous 20 0 3716M 359M 49184 S 0.0 9.1 0:01.02 │ ├─ /usr/bin/clickhouse-server --config=/etc/ 29666 clickhous 20 0 3716M 359M 49184 S 0.0 9.1 0:00.08 │ ├─ /usr/bin/clickhouse-server --config=/etc/ 29665 clickhous 20 0 3716M 359M 49184 S 0.0 9.1 0:00.48 │ ├─ /usr/bin/clickhouse-server --config=/etc/ 29409 clickhous 20 0 3716M 359M 49184 S 0.0 9.1 0:03.48 │ ├─ /usr/bin/clickhouse-server --config=/etc/ 29408 clickhous 20 0 3716M 359M 49184 S 0.0 9.1 0:02.15 │ ├─ /usr/bin/clickhouse-server --config=/etc/ In terms of configuration, any changes made while using the software will be saved automatically to ~/.config/htop/htoprc by default. This file is text-based but comes with the following warning: $ head -n2 ~/.config/htop/htoprc # Beware! This file is rewritten by htop when settings are changed in the interface. # The parser is also very primitive, and not human-friendly. The source code is still quiet small given the functionality on offer. As of this writing there's a total of ~12,000 lines of C code with other files making up a further ~3,000 lines of code.

Glances Glances is a Python-based systems telemetry monitor. The project was started by Nicolas Hennion in 2011. Nicolas' LinkedIn profile states he works in the South of France as a Project Manager in the Satellite Control Centre Department for Thales Alenia Space. When you launch Glances, in addition to the regular CPU, memory and process lists, you'll see the Cloud virtual machine type as well as network, disk and docker container activity to name just a few items. $ glances ubuntu (Ubuntu 16.04 64bit / Linux 4.4.0-62-generic) Uptime: 18:55:00 CPU [ 1.7%] CPU - 1.7% nice: 0.0% ctx_sw: 923 MEM - 53.1% SWAP - 0.1% LOAD 4-core MEM [ 53.1%] user: 0.8% irq: 0.0% inter: 587 total: 3.84G total: 4.00G 1 min: 0.20 SWAP [ 0.1%] system: 0.7% iowait: 0.0% sw_int: 786 used: 2.04G used: 3.27M 5 min: 0.14 idle: 98.4% steal: 0.0% free: 1.80G free: 3.99G 15 min: 0.10 NETWORK Rx/s Tx/s TASKS 203 (349 thr), 1 run, 202 slp, 0 oth sorted automatically by CPU consumption ens33 152b 3Kb lo 59Kb 59Kb CPU% MEM% VIRT RES PID USER TIME+ THR NI S R/s W/s Command 2.6 4.5 524M 178M 16470 mark 35:48 1 0 S 0 0 /home/mark/. DISK I/O R/s W/s 2.3 0.6 372M 24.5M 14672 mark 0:01 1 0 R 0 0 /home/mark/. fd0 0 0 1.0 23.7 5.42G 931M 21151 root 13:00 71 0 S ? ? java -Xmx1G loop0 0 0 0.7 9.8 3.71G 385M 27884 clickhous 5:29 46 0 S ? ? /usr/bin/cli loop1 0 0 0.3 2.8 3.53G 109M 12883 zookeeper 1:36 20 0 S ? ? /usr/bin/jav loop2 0 0 0.3 0.2 31.4M 6.80M 333 root 0:53 1 0 S ? ? /lib/systemd loop3 0 0 0.3 0.1 13.8M 2.68M 4353 mark 1:07 1 0 S 0 0 watch ifconf loop4 0 0 0.0 0.3 186M 9.86M 1447 root 0:35 2 0 S ? ? /usr/bin/vmt loop5 0 0 0.0 0.2 75.2M 8.11M 1470 root 0:00 1 0 S ? ? /usr/bin/VGA loop6 0 0 0.0 0.2 90.6M 6.59M 4381 root 0:00 1 0 S ? ? sshd: mark [ loop7 0 0 0.0 0.1 269M 5.75M 595 root 0:13 3 0 S ? ? /usr/lib/acc sda 0 78K 0.0 0.1 36.7M 5.37M 1 root 0:37 1 0 S ? ? /sbin/init n sda1 0 78K 0.0 0.1 64.0M 5.31M 4246 root 0:00 1 0 S ? ? /usr/sbin/ss sda2 0 0 0.0 0.1 44.3M 5.05M 3402 mark 0:00 1 0 S 0 0 /lib/systemd sda5 0 0 0.0 0.1 21.8M 5.04M 4403 mark 27:23 1 0 S 0 0 -bash sr0 0 0 0.0 0.1 21.8M 4.93M 21493 mark 0:10 1 0 S 0 0 /bin/bash sr1 0 0 0.0 0.1 21.7M 4.62M 16114 mark 0:03 1 0 S 0 0 /bin/bash 0.0 0.1 21.7M 4.47M 21119 mark 0:00 1 0 S 0 0 /bin/bash FILE SYS Used Total 0.0 0.1 90.6M 4.14M 4402 mark 0:08 1 0 S ? ? 0 / (sda1) 2.48G 15.6G 0.0 0.1 250M 3.97M 588 syslog 0:28 4 0 S ? ? /usr/sbin/rs 0.0 0.1 21.8M 3.87M 3407 mark 0:04 1 0 S 0 0 -bash SENSORS 0.0 0.1 51.5M 3.76M 21144 root 0:00 1 0 S ? ? sudo nohup / Physical id 100C 0.0 0.1 41.9M 3.64M 597 messagebu 0:00 1 0 S ? ? /usr/bin/dbu Core 0 100C 0.0 0.1 43.2M 3.45M 396 root 0:01 1 0 S ? ? /lib/systemd Core 1 100C 0.0 0.1 64.3M 3.21M 3377 root 0:00 1 0 S ? ? /bin/login - Core 2 100C 0.0 0.1 28.0M 2.91M 592 root 0:00 1 0 S ? ? /lib/systemd Core 3 100C 0.0 0.1 26.7M 2.86M 16113 mark 0:06 1 0 S ? ? SCREEN 0.0 0.1 15.7M 2.81M 774 root 0:00 1 0 S ? ? /sbin/dhclie Glances is written with ~10K lines of Python, ~25K lines of JavaScript and relies on the psutil package for it's telemetry collection. There are a huge variety of plugins including support for monitoring GPUs, Kafka, RAID setups, folder monitoring and WiFi to name a few. In addition to the ncurses-based interface, Glances can also run as a web application. When you run glances via cmd.exe on Windows 10 it'll launch a Bottle-based Web Application on TCP port 61209. When you load up http://127.0.0.1:61209/ in a web browser you'll be greeted with an AngularJS-based Application that mimics the ncurses interface. There is an API exposed as well if you want to consume it with other tools. $ curl http://127.0.0.1:61209/api/3/all \ | python -mjson.tool \ | head -n50 { "alert" : [], "amps" : [], "batpercent" : [], "cloud" : {}, "core" : { "log" : 4 , "phys" : 4 }, "cpu" : { "cpucore" : 4 , "ctx_switches" : 182358 , "idle" : 82.9 , "interrupts" : 113134 , "soft_interrupts" : 0 , "syscalls" : 215848 , "system" : 12.5 , "time_since_update" : 8.532670974731445 , "total" : 9.8 , "user" : 3.1 }, "diskio" : [ { "disk_name" : "PhysicalDrive6" , "key" : "disk_name" , "read_bytes" : 0 , "read_count" : 0 , "time_since_update" : 8.492774963378906 , "write_bytes" : 0 , "write_count" : 0 }, { "disk_name" : "PhysicalDrive2" , "key" : "disk_name" , "read_bytes" : 0 , "read_count" : 0 , "time_since_update" : 8.492774963378906 , "write_bytes" : 0 , "write_count" : 0 }, ... The default configuration file is somewhat lengthy but is friendly enough for human editing. Glances also supports exporting telemetry to over 16 different targets including statsd, Kafka, RabbitMQ, JSON, SVG, Elasticsearch, CSV as well as to bespoke RESTful APIs.