Parallel processing with unix tools

piping

An often under appreciated idea in the unix pipe model is that the components of the pipe run in parallel. This is a key advantage leveraged when combining simple commands that do "one thing well"

An often under appreciated idea in the unix pipe model is that the components of the pipe run in parallel. This is a key advantage leveraged when combining simple commands that do "one thing well" split -n , xargs -P , parallel

Note programs that are invoked in parallel by these, need to output atomically for each item processed, which the GNU coreutils are careful to do for factor and sha*sum, etc. Generally commands that use stdio for output can be wrapped with the `stdbuf -oL` command to avoid intermixing lines from parallel invocations

, , Note programs that are invoked in parallel by these, need to output atomically for each item processed, which the GNU coreutils are careful to do for factor and sha*sum, etc. Generally commands that use stdio for output can be wrapped with the `stdbuf -oL` command to avoid intermixing lines from parallel invocations make -j

Most implementations of make(1) now support the -j option to process targets in parallel. make(1) is generally a higher level tool designed to process disparate tasks and avoid reprocessing already generated targets. For example it is used very effictively when testing coreutils where about 700 tests can be processed in 13 seconds on a 40 core machine.

Most implementations of make(1) now support the -j option to process targets in parallel. make(1) is generally a higher level tool designed to process disparate tasks and avoid reprocessing already generated targets. For example it is used very effictively when testing coreutils where about 700 tests can be processed in 13 seconds on a 40 core machine. implicit threading

This goes against the unix model somewhat and definitely adds internal complexity to those tools. The advantages can be less data copying overhead, and simpler usage, though its use needs to be carefully considered. A disadvantage is that one loses the ability to easily distribute commands to separate systems. Examples are GNU sort(1) and turbo-linecount

Counting lines in parallel

There are various ways to use parallel processing in UNIX:The examples below will compare the above methods for implementing multi-processing, for the function ofin a file.

First of all let's generate some test data. We use both long and short lines to compare the overhead of the various methods compared to the core cost of the function being performed:

$ seq 100000000 > lines.txt # 100M lines $ yes $(yes longline | head -n9) | head -n10000000 > long-lines.txt # 10M lines

We'll also define the add() { paste -d+ -s | bc; } helper function to add a list of numbers.

wc -l

$ time wc -l lines.txt real 0m0.559s user 0m0.399s sys 0m0.157s $ time wc -l long-lines.txt real 0m0.263s user 0m0.102s sys 0m0.158s

$ time fedora-25-wc -l lines.txt real 0m1.039s user 0m0.900s sys 0m0.134s

turbo-linecount

time tlc lines.txt real 0m0.536s # third fastest user 0m1.906s # but a lot less efficient sys 0m0.100s time tlc long-lines.txt real 0m0.146s # second fastest user 0m0.336s # though less efficient sys 0m0.110s

split -n

$ time split -n$(nproc) --filter='wc -l' lines.txt | add real 0m0.743s user 0m0.495s sys 0m0.702s $ time split -n$(nproc) --filter='wc -l' long-lines.txt | add real 0m0.540s user 0m0.155s sys 0m0.693s

$ time for i in $(seq $(nproc)); do split -n$i/$(nproc) lines.txt | wc -l& done | add real 0m0.432s # second fastest $ time for i in $(seq $(nproc)); do split -n$i/$(nproc) long-lines.txt | wc -l& done | add real 0m0.266s # third fastest

$ time split -nr/$(nproc) --filter='wc -l' lines.txt | add real 0m4.773s user 0m5.678s sys 0m1.464s $ time split -nr/$(nproc) --filter='wc -l' long-lines.txt | add real 0m1.121s # significantly less overhead for longer lines user 0m0.927s sys 0m1.339s

parallel

$ time parallel --will-cite --block=200M --pipe 'wc -l' < lines.txt | add real 0m1.863s user 0m1.192s sys 0m2.542s

$ time parallel --will-cite --block=200M --pipepart -a lines.txt 'wc -l' | add real 0m0.693s user 0m0.941s sys 0m1.142s

$ time parallel --will-cite --plus 'split -n{%}/{##} {1} | wc -l' \ ::: $(yes lines.txt | head -n$(nproc)) | add real 0m0.656s user 0m0.949s sys 0m0.944s

xargs -P

split -d -n l/$(nproc) lines.txt l.

$ time find -maxdepth 1 -name 'l.*' | xargs -P$(nproc) -n1 wc -l | cut -f1 -d' ' | add real 0m0.267s # joint fastest user 0m0.760s sys 0m0.262s $ time find -maxdepth 1 -name 'll.*' | xargs -P$(nproc) -n1 wc -l | cut -f1 -d' ' | add real 0m0.131s # joint fastest user 0m0.251s sys 0m0.233s

make -j

%: FORCE # Always run the command @wc -l < $@ FORCE: ; Makefile: ; # Don't include Makefile itself

$ time find -name 'l.*' -exec make -j$(nproc) {} + | add real 0m0.269s # joint fastest user 0m0.737s sys 0m0.292s $ time find -name 'll.*' -exec make -j$(nproc) {} + | add real 0m0.132s # joint fastest user 0m0.233s sys 0m0.256s

© Aug 20 2017

Note the following runs were done against cached files, and thus not I/O bound. Therefore we limit the number of processes in parallel to $(nproc), though you would generally benefit to raising that if your jobs are waiting on network or disk etc.We'll use this command to count lines for most methods, so here is the base non multi-processing performance for comparison:Note the distro version (v8.25) not being compiled with --march makes a significant difference, but only for the short line case. We'll not use the distro version in the following tests. turbo-linecount is an example of multi-threaded processing of a file.Note using -n alone is not enough to parallelize. For example this will run serially with each chunk, because since --filter may write files, the -n pertains to the number of files to split into rather than the number to process in parallel.You can either run multiple invocations of split in parallel on separate portions of the file like:Or split can do parallel mode using round robin on, but that's huge overhead in this case. (Note also the -u option significant with -nr):Round robin would only be useful when the processing per item is significant.Parallel isn't well suited to processing a large single file, rather focusing on distributing multiple files to commands. It can't efficiently split to lightweight processing if reading sequentially from pipe:Though has support for processing parts of a seekable file in parallel with the --pipepart option (added in version 20161222):We can use parallel(1) to drive split similarly to the for loop construct above but it's a little awkward and slower, but does demonstrate the flexibility of the parallel(1) tool.Like parallel, xargs is designed to distribute separate files to commands, and with the -P option can do so in parallel. If you have a large file then it may be beneficial to presplit it, which could also help with I/O bottlenecks if the pieces were placed on separate devices:Those pieces can then be processed in parallel like:If your file sizes are unrelated to the number of processors then you will probably want to adjust -n1 to batch together more files to reduce the number of processes run in total. Note you should always specify -n with -P to avoid xargs accumulating too many input items, thus impacting the parallelism of the processes it runs.make(1) is generally used to process disparate tasks, though can be leveraged to provide low level parallel processing on a bunch of files. Note also the make -O option which avoids the need for commands to output their data atomically, letting make do the synchronization. We'll process the presplit files as generated for the xargs example above, and to support that we'll use the following Makefile:One could generate this and pass to make(1) with the -f option, though we'll keep it as a separate Makefile here for simplicity. This performs very well and matches the performance of xargs.Note we use the POSIX specified "find ... -exec ... {} +" construct, rather than conflating the example with xargs. This construct like xargs will pass as many files to make as possible, which make(1) will then process in parallel.