Linux Delay Accounting

Ever wondered how long is your program spending while waiting for I/O to finish? Or if it is spending lots of time while waiting for a turn to run on one of the cpus? Linux provides delay accounting information that may help answering these and other questions. Delay information is available for many types of resources:

waiting for a CPU (while being runnable) completion of synchronous block I/O initiated by the task swapping in pages memory reclaim

These information is available in nanoseconds, on a per pid/tid basis, and is pretty useful to find out if your system resources are saturated by the number of concurrent tasks running on the machine. You can either: reduce the amount of work being done on the machine by removing unecessary processes or adjust the priority (cpu priority, io priority and rss limit) for important tasks.

Acessing delay accounting information

This information is available for userspace programs thru the Netlink interface, an interface a user-space program in linux uses to communicate with the kernel. It can be used by a bunch of stuff: managing network interfaces, setting ip addresses and routes and so on.

Linux ships with a source code example, getdelays, on how to build tools to consume such information [2]. By using ./getdelays -d -p <PID> we can visualize the delay experienced by process while consuming different kinds of resources.

Side note: since this commit, Linux requires a process to run as root to be able to fetch delay accounting information. I plan to check up if these could be changed so an user may check delay information on any process owned by him/her.

getdelays states that “It is recommended that commercial grade applications use libnl or libnetlink and use the interfaces provided by the library”, so I decided to rewrite part of getdelays using a higher level library, instead of having to handle parsing and other instrinsics of the netlink protocol.

Re-implementing getdelays using libnl

I found libnl to be a quite flexible library and was able to write this example in a couple of hours (and I didn’t have any prior experience with netlink). Their documentation on the Netlink protocol had everything I needed to understand the protocol.

The source code for my implementation is available on my github and uses libnl to “talk” netlink. In the following sections I`ll highlight the most important parts of the implementation.

1. Setup

sk = nl_socket_alloc(); if (sk == NULL) { fprintf(stderr, "Error allocating netlink socket"); exit_code = 1; goto teardown; } if ((err = nl_connect(sk, NETLINK_GENERIC)) < 0) { fprintf(stderr, "Error connecting: %s

", nl_geterror(err)); exit_code = 1; goto teardown; } if ((family = genl_ctrl_resolve(sk, TASKSTATS_GENL_NAME)) == 0) { fprintf(stderr, "Error retrieving family id: %s

", nl_geterror(err)); exit_code = 1; goto teardown; }

The setup is pretty straightforward:

we start by calling nl_socket_alloc() to allocate a netlink socket, required for the communication with the netlink interface the call to nl_connect connects our socket to the NETLINK_GENERIC protocol (depending on our needs, we can use other protocols like NETLINK_ROUTE for routing operations) gen_ctrl_resolve is used to obtain the family id of the taskstats. This is the “postal code” of the delay information holder

After the setup we are ready to prepare our netlink message.

2. Preparing our message

if ((err = nl_socket_modify_cb(sk, NL_CB_VALID, NL_CB_CUSTOM, callback_message, NULL)) < 0) { fprintf(stderr, "Error setting socket cb: %s

", nl_geterror(err)); exit_code = 1; goto teardown; } if (!(msg = nlmsg_alloc())) { fprintf(stderr, "Failed to alloc message: %s

", nl_geterror(err)); exit_code = 1; goto teardown; } if (!(hdr = genlmsg_put(msg, NL_AUTO_PID, NL_AUTO_SEQ, family, 0, NLM_F_REQUEST, TASKSTATS_CMD_GET, TASKSTATS_VERSION))) { fprintf(stderr, "Error setting message header

"); exit_code = 1; goto teardownMsg; } if ((err = nla_put_u32(msg, TASKSTATS_CMD_ATTR_PID, pid)) < 0) { fprintf(stderr, "Error setting attribute: %s

", nl_geterror(err)); exit_code = 1; goto teardownMsg; }

Libnl offers a bunch of callback hooks that can be used to handle different kinds of events. Using nl_socket_modify_cb we register a custom callback ( NL_CB_CUSTOM ) callback_message that will be called for all valid messages received from the kernel ( NL_CB_VALID ) nlmsg_alloc allocs a struct to hold the message that will be sent genlmsg_put sets the messsage header: NL_AUTO_PID and NL_AUTO_SEQ tells libnl to fill in the message sequence and pid number, required by the protocol; family is the taskstats family id; NLM_F_REQUEST indicates that this message is a request; TASKSTATS_CMD_GET is the command that we are sending to the taskstats interface, meaning that we want to get some information and TASKSTATS_VERSION is used by the kernel to be able to handle different versions of this interface nla_put_u32 sets an attribute TASKSTATS_CMD_ATTR_PID , which indicates that we are asking for the taskstats information of a particular pid , provided as the header value

3. Sending the message

if ((err = nl_send_sync(sk, msg)) < 0) { fprintf(stderr, "Error sending message: %s

", nl_geterror(err)); exit_code = 1; goto teardownMsg; } if ((err = nl_recvmsgs_default(sk)) < 0) { fprintf(stderr, "Error receiving message: %s

", nl_geterror(err)); exit_code = 1; goto teardownMsg; }

nl_send_sync sends a message using the socket and waits for an ack or an error message nl_recvmsgs_default waits for a message; this will block until the message is parsed by our callback

4. Receiving the response

Handling of the response is done by the callback_message function:

int callback_message(struct nl_msg *nlmsg, void *arg) { struct nlmsghdr *nlhdr; struct nlattr *nlattrs[TASKSTATS_TYPE_MAX + 1]; struct nlattr *nlattr; struct taskstats *stats; int rem, answer; nlhdr = nlmsg_hdr(nlmsg); if ((answer = genlmsg_parse(nlhdr, 0, nlattrs, TASKSTATS_TYPE_MAX, NULL)) < 0) { fprintf(stderr, "error parsing msg

"); return -1; } if ((nlattr = nlattrs[TASKSTATS_TYPE_AGGR_PID]) || (nlattr = nlattrs[TASKSTATS_TYPE_NULL])) { stats = nla_data(nla_next(nla_data(nlattr), &rem)); print_delayacct(stats); } else { fprintf(stderr, "unknown attribute format received

"); return -1; } return 0; }

nlmsg_hdr returns the actual message header from nlmsg genlmsg_parse parses a generic netlink message and stores the attributes to nlattrs we retrieve the attribute we are interested: TASKSTATS_TYPE_AGGR_PID nla_data returns a pointer to the payload of the message, we need to use nla_next because the taskstats data is actually returned on the second attribute (the first one being used just to indicate that a pid/tid will be followed by some stats) print_delayacct is used to finally print the data; this function is the same used by the linux example.

Delay examples

Let’s try to visualize some of the delay types be crafting some examples and running getdelays .

CPU scheduling delay

In this example I’m going to use the stress utility to generate some workload on a VM that has 2 cores. Using the -c <N> flag, stress creates <N> workers (forks) running sqrt() to generate some CPU load. Since this VM has two cores, I will spin two instance of stress with 2 workers each. By using the nice command, I’ll configure the niceness of the first instace to be 19, meaning that it will have a lower priority on the scheduling:

$ sudo nice -n 19 stress -c 2 & sudo stress -c 2 stress: info: [15718] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd stress: info: [15719] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd

We can check with ps that we have now 6 processes running stress , the two parents and their two forks:

root 15718 0.0 0.0 7480 864 pts/2 SN 14:24 0:00 stress -c 2 root 15719 0.0 0.0 7480 940 pts/2 S+ 14:24 0:00 stress -c 2 root 15720 1.4 0.0 7480 92 pts/2 RN 14:24 0:01 stress -c 2 root 15721 1.4 0.0 7480 92 pts/2 RN 14:24 0:01 stress -c 2 root 15722 96.3 0.0 7480 92 pts/2 R+ 14:24 2:00 stress -c 2 root 15723 99.0 0.0 7480 92 pts/2 R+ 14:24 2:03 stress -c 2

With getdelays we can check their CPU delays (output truncated):

$ ./getdelays -d -p 15722 PID 15722 CPU count real total virtual total delay total delay average 3386 130464000000 132726743949 4190941076 1.238ms $ ./getdelays -d -p 15723 PID 15723 CPU count real total virtual total delay total delay average 3298 136240000000 138605044896 550886724 0.167ms $ ./getdelays -d -p 15720 PID 15720 CPU count real total virtual total delay total delay average 533 2060000000 2084325118 142398167037 267.164ms $ ./getdelays -d -p 15721 PID 15721 CPU count real total virtual total delay total delay average 564 2160000000 2178262982 148843119281 263.906ms

Clearly, the ones from with high niceness value are experience higher delays (the average delay is around 200x higher). If we ran both instances of stress with the same niceness, we will experience the same average delay accross then.

Block I/O delay

Let’s try to experience some I/O delays running a task. We can leverage docker to limit the I/O bps for our process using the --driver-write-bps flag on docker run . First, let’s run dd without any limits:

docker run --name dd --rm ubuntu /bin/dd if=/dev/zero of=test.out bs=1M count=8096 oflag=direct

The following screenshot shows the result obtained by running getdelays on the dd process:

[email protected]:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2904 print delayacct stats ON PID 2904 CPU count real total virtual total delay total delay average 6255 1068000000 1879315354 22782428 0.004ms IO count delay total delay average 5988 13072387639 2ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 0 0 0ms

We can see that we are getting an average of 2ms delays for I/O.

Now, let’s use --driver-write-bps to limit I/O to 1mbs :

docker run --name dd --device-write-bps /dev/sda:1mb --rm ubuntu /bin/dd if=/dev/zero of=test.out bs=1M count=8096 oflag=direct

The following screenshot shows the result of running getdelays on the process:

[email protected]:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2705 print delayacct stats ON listen forever PID 2705 CPU count real total virtual total delay total delay average 71 28000000 32436630 600096 0.008ms IO count delay total delay average 15 40163017300 2677ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 0 0 0ms

Since I/O is limited, dd takes much more time to write its output, we can see that our I/O delay average is 1000 times higher than before.

Side note: using --driver-write-<bps,iops> docker flags uses linux cgroups v1 and those are only able to limit the amount of I/O if we open the files with O_DIRECT , O_SYNC or O_DSYNC flags, but this deserver a blog post on its own.

Memory reclaim delay

In this example we can use, once more, the stress utility by using the --vm <N> flag to launch N workers running malloc/free to generate some memory allocation workload. Once again, this VM has 2 cores.

Using the default --vm-bytes , which is 256M, I was able to experience some delay on memory reclaim by running more than 2 workers. But the delay average was kept fairly small, below 1ms:

PID 15888 CPU count real total virtual total delay total delay average 2799 38948000000 39507647880 19772492888 7.064ms RECLAIM count delay total delay average 11 278304 0ms PID 15889 CPU count real total virtual total delay total delay average 3009 38412000000 38904584951 20402080112 6.780ms RECLAIM count delay total delay average 22 16641801 0ms PID 15890 CPU count real total virtual total delay total delay average 2954 39172000000 39772710066 19571509440 6.625ms RECLAIM count delay total delay average 39 9505559 0ms

Since the 3 tasks are competing on a 2 core CPU, the CPU delays were much higher. Running with --vm-bytes with lower values produced even lower memory reclaim delays (in some cases, no delay is experienced).

Not many tools expose linux delays to the end user, but those are available on cpustat. I’m currently working on a PR to get them on htop.