I recently discovered a Linux shell script that was running slowly due to an inefficiently implemented loop. This innocent investigation ended up uncovering misleading information from time and a bad interaction between the Linux thread scheduler and its CPU power management algorithms.

How often does an eight line shell script expose such rich information?

Conclusions first, explanations later

Using $(expr) to do math is over a hundred times slower than using bash internals. This is not a new discovery, but it’s worth reiterating. Excessive use of sub processes (see point 1) can confuse the Linux kernel’s power management algorithms, leading to an additional three times slowdown. Interesting… time gives accurate elapsed time information but its CPU time information should not be trusted, especially if you make excessive use of sub processes (see point 1). Interesting… When time lies you can always use the Windows Task Manager to measure Linux CPU consumption. Amusing. Looping in Windows batch files can be faster than looping in Linux shell scripts. Process creation in Windows is slower than on Linux. Not surprising.

This is part of a series of posts about Linux. Previous posts include:

The setup

We recently added a new shell script to our build process and I noticed that this was causing a two times to three times slowdown on our builds. Not good. One of my coworkers found that the use of $(expr) to loop over arguments was the culprit. A minimal repro is shown here:

# Warning: this code is excessively slow ExprCount() { for (( i = $1 ; i > 0; i=$(expr $i - 1) )); do : done echo Just did $1 iterations using expr math } time ExprCount 1000

Note: this code copies/pastes much better on Chrome than on Internet Explorer. FWIW.

On my 3.2 GHz six-core twelve-thread Core i7-3930K this script took almost two seconds to run and gave results like this:

Just did 1000 iterations using expr math real 0m1.883s

user 0m0.096s

sys 0m0.260s

This means that my machine can do only 530 iterations a second using this technique because each $(expr) statement is starting up a child process in order to decrement an integer. Linux process startup has a reputation for being fast, but this is a reminder that ‘fast’ is not the same as ‘free’. So far so good.

I tested to see whether expr was particularly slow by having a loop that invoked $(nop) a thousand times, where nop was just a C program that just returns immediately. The cost to invoke $(nop) was measured to be about 20% lower than the cost to use $(expr) to do some math, so not really enough faster to be interesting.

$(expr) is commonly recommended for implementing counters in shell scripts, but this is bad advice.

Faster looping

The solution is to use the math operations that are built into bash like this:

# This code performs quite well BashCount() { for (( i = $1 ; i > 0 ; i-- )); do : done echo Just did $1 iterations using bash math } time BashCount 150000

On the same test machine this script gave results like this:

Just did 150000 iterations using bash math real 0m0.716s

user 0m0.684s

sys 0m0.032s

In other words, using the built-in decrement operation ran about 400 times faster than using $(expr)! Not a huge surprise – experts know this already – but still, nice.

Anomaly the first, CPU usage

A less OCD person might have called a 400 times speedup sufficient, but I’d noticed a few peculiarities during the investigation that I wanted to investigate. Noticing and investigating anomalies is, I believe, a critical skill for a developer, as anomalies often reveal mistaken assumptions, bugs, or ‘jankiness’.

The results from time had been confusing throughout the investigation. They consistently showed that only about 15-25% of the elapsed time in the $(expr) loops was CPU time. This then caused me to not use CPU profilers during the investigation but instead think about what could be causing stalls. Network delays? Disk I/O? Nothing made sense.

I eventually ran iostat 1 in another window and noticed that it said that CPU usage during my test runs was about 8.5%. On my twelve hardware thread (six cores, hyperthreaded) system this corresponds to one CPU core. In other words, total CPU time was equal to elapsed time, the script was CPU bound, and time was lying to me.

I then used a stopwatch to check that time’s elapsed time calculations were correct. Finally I used multiple other tools to confirm that my ExprCount() shell script, running on an otherwise idle machine, was indeed CPU bound:

Running the test script under perf stat showed 0.945 CPUs utilized

Running mpstat -P ALL 1 in another terminal showed one core worth of CPU time being consumed

Context switch graphing (see screenshot in the next section) showed that the task was CPU bound

It also made sense that the loop would be CPU bound. After the first invocation of $(expr) everything is in the disk cache so there should be no I/O latency. All of process creation, reading from the disk cache, process execution, and process teardown is done by the CPU, so the overall task should be entirely CPU bound. And it is.

So, be aware that time always underestimates CPU time, and that this underestimation is roughly proportional to how many sub processes you create per second. Presumably the process creation time is being attributed neither to the parent nor the child process, even though it really should be, as sys time.

This anomaly occurs both with the bash time-builtin and with /usr/bin/time. I assume that the problem is ultimately a limitation of what the Linux kernel reports back to time, so the proper fix here is probably to acknowledge the limitation in the man page. I’m discussing this with the man-page maintainers.

Since time is mostly just a wrapper around the wait3() call it makes sense that both time versions give the same results – they are just reporting what the kernel tells them.

To be crystal clear, my complaint/concern is that when time is used to measure my ExprCount() loop it makes the loop seem like it is not CPU bound, when it is. This can lead a performance investigation astray. I believe that this should be fixed, or documented.

Anomaly the second, CPU frequency

I was about ready to declare the case closed when I noticed another anomaly. The ExprCount() time was usually extremely stable, but occasionally it would run three times faster – usually it was 400 times slower than BashCount() but sometimes it was only ~133 times slower. Now that is an anomaly worth investigating.

I quickly discovered that if a busy loop CPU hog was spinning away in another process then the ExprCount() loop would run three times faster. The loop that used bash math was unaffected. I immediately suspected power management, and I eventually verified this by running the test scripts under perf stat. Running the tests under perf stat was the most rewarding because it clearly shows the CPU frequency, and it showed that it was far lower when running the ExprCount() loop:

Performance counter stats for ‘counttest.sh ExprTest’: 1780.553888 task-clock # 0.979 CPUs utilized

3,183 context-switches # 0.002 M/sec

1,119 CPU-migrations # 0.001 M/sec

319,758 page-faults # 0.180 M/sec

1,948,359,438 cycles # 1.094 GHz Performance counter stats for ‘counttest.sh BashTest’: 716.316659 task-clock # 0.997 CPUs utilized

74 context-switches # 0.000 M/sec

0 CPU-migrations # 0.000 M/sec

354 page-faults # 0.000 M/sec

2,524,171,488 cycles # 3.524 GHz

The perf stat results also hint at why the CPU frequency is lower on the ExprCount() tests. Perf stat shows that the ExprCount() loop does thousands of CPU migrations compared to zero in the BashCount() loop.

Once I started to suspect lots of CPU migrations it was interesting to see what other tools might be useful. Running mpstat -P ALL 1 while running the two different count loops showed dramatic results. When the counting was done with $(expr) then the CPU usage was distributed relatively evenly across all of my CPU cores. However when the counting was done with bash internals then CPU usage was localized entirely to one core.

As a final bit of investigative reporting (I enjoy this sort of stuff way too much) I used the Zoom profiler to record context switches while running the $(expr) loop. The results, color coded by process and put on per-CPU racetracks, are shown below:

The bash process is in red and we can see that it spends very little time running. The other colors are various invocations of expr which bash is usually waiting for. The fascinating thing is how often the processes migrate between CPUs. The bash process alternates between CPUs 2 and 5, and each expr process starts on one thread and is then forcibly migrated to another.

We can see that the overall task is CPU bound. The bash process runs for a fraction of a millisecond, then expr launches, then expr moves to another core, then expr terminates and the cycle repeats.

After discussions with some kernel developers it appears that new processes are placed on the least-busy core, however that doesn’t explain why bash keeps moving or why each expr instance moves once.

The net result is that none of the CPUs are fully loaded which, apparently, means that the power management algorithms don’t realize that a CPU bound task is running. It’s kind of crazy awesome, and it is an example of when a multi-core system actually runs (a lot) more slowly than a single-core system.

Setting thread affinity is usually a bad idea because it reduces the kernel’s scheduling flexibility and may cause unexpected interactions with other processes. However, in the interests of science I knew I had to run my test script under taskset 1 to see what happens when it is forced to a single CPU. Doing this caused the average CPU clock speed in ExprCount() to triple, and performance increased identically.

Please don’t work around this by setting your power management settings to High Performance. If every Linux user did that then the cumulative extra power draw would be significant and would make me regret writing this article. The behavior is an anomaly. It probably should be fixed, but in most situations I’m sure that the power management algorithms work fine. Just don’t go create hundreds of child processes per second.

I tested this on Ubuntu 12.04.1 (32-bit) and Ubuntu 12.04.2 (64-bit). The timings were slightly different, but the conclusions were fundamentally the same.

Measuring Linux performance using Windows

While trying to decide which tools to trust regarding CPU consumption I decided to let Windows be the tie breaker. I have a Linux VM on my laptop so I ran my tests in the VM and then monitored the CPU consumption of the VM with Task Manager. It correctly showed that the VM, and therefore the task running inside it, was CPU bound. Which made me laugh.

Measuring Windows Performance

In the interests of science I made some similar measurements on Windows. First, a simple batch file that counts down using internal commands.

@echo off

echo Starting for loop at %time%

for /L %%i in (400000,-1,0) do (

rem empty statement

)

echo Finishing at %time%

This ran in about a second, so 400,000 iterations per second.

Then I added a call to nop.exe to the loop to measure minimal process creation time (and reduced the iteration count to 1,000). Here’s the new batchfile:

@echo off

echo Starting for loop at %time%

for /L %%i in (1000,-1,0) do (

nop.exe

)

echo Finishing at %time%

This showed that a Windows batch file can launch about 133 processes per second. A lot slower than Linux.

Counting the easy way

Maybe it would be better if I did my counting in PHP instead:

https://github.com/Herzult/SimplePHPEasyPlus

Or maybe counting in assembly language (~14 billion subtractions a second?) would be better.

Full test program

Here’s my full test script:

#!/bin/bash # Warning: this code is excessively slow ExprCount() { for (( i = $1 ; i > 0; i=$(expr $i - 1) )); do : done echo Just did $1 iterations using expr math } # This code performs quite well BashCount() { for (( i = $1 ; i > 0 ; i-- )); do : done echo Just did $1 iterations using bash math } ExprCountWrapper() { time ExprCount $1 } BashCountWrapper() { time BashCount $1 } case "$1" in "ExprTest") ExprCountWrapper 1000;; "BashTest") BashCountWrapper 150000;; "") perf stat counttest.sh ExprTest && perf stat counttest.sh BashTest;; esac