I am studying the use of open source software in industry. One way to obtain empirical data is to look at the operating systems and browsers used by the Fortune 1000 companies by examining browser logs. I obtained a list of the Fortune 1000 domains and wrote a pipeline to summarize results by going through this site's access logs.

My first pipeline was the following.

fgrep -h -f fortune-1000-domains.txt access_log | sed 's/^\([^ ]*\).*"\(.[^"]*\)"$/\1 \2/' | sort | uniq -c >results.txt

lj512272.crawl.yahoo.net - - [01/Jan/2008:00:00:15 +0200] "GET /blog/20040121/index.html HTTP/1.0" 200 4953 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

The Unix fgrep command is optimized for searching its input for (a possibly large number of) fixed strings. To perform the search fgrep builds a data structure that looks at each character of its input at most once, even when matching hundreds of patterns. The sed command I used takes each matching log line and prints from it only the domain name and the client details (browser and operating system). A typical web browser log line looks as follows.The sed substitution command works by matching the space-terminated name of the domain, and then matching a sequence of characters within double quotes at the end of the line.

I felt a bit uneasy about the regular expression pattern I used, because you may need to try various solutions until you determine that you have found the pattern at the end of the line. This requires backtracking and can be expensive.

Running top to see where my system was spending its time confirmed my suspicion.

PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 63910 dds 1 132 0 1328K 728K RUN 15:01 86.18% sed 63913 dds 1 -8 0 1892K 1160K pipewr 0:09 0.10% fgrep

As you can see, fgrep, which was supposed to be doing the difficult work was spending just 0.1% of the system's time, while the sed's trivial post-processing step was taking 86% of the time. Obviously the pipeline was not properly balanced, and this would surely affect its performance.

I therefore rewrote the sed substitution command in a more imperative style.

sed 's/"$//;s/^\([^ ]*\).*"/\1 /'

This removes the final quotation mark, and then greedily replaces the client's address and all text up to the final quotation mark with just the client's address.

This time the balance between the two processes was a lot more even.

PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 63709 dds 1 -8 0 1328K 728K piperd 0:14 46.55% sed 63708 dds 1 -4 0 1892K 1156K getblk 0:03 11.34% fgrep

The performance difference between the two options was startling. The first took 1004 s to complete, while the second one took just 114 s: an order of magnitude improvement. This demonstrates the performance advantage of a well-tempered pipeline.