Simulating Symmetric Multi-Processing with fork()

This example shows how to use excl.osi:fork on non-Windows platforms to utilize all available processors for Lisp computations. It assumes that your application has compute bound parts which can be run in parallel. In the source code linked at the end of this page, the "work" is simulated with a loop calling expt.

This technique is not appropriate for problems where the granularity of parallelism is very fine. The overhead would be too large for those problems. The overhead is such that around 11,500 calls per second can be made with this framework, on a 1.8GHz x86_64 machine.

The first part of the example code is a framework for executing the work units on different processors. The second part is a specific example using this framework.

This example does not use anything fancy to pass information between the parent and child processes, just the printer and reader. The less information passed the less overhead there will be.

Terminology:

task : a unit of work, or in lisp terms an expression that can be evaluated

: a unit of work, or in lisp terms an expression that can be evaluated CPU : an actual hardware processor

: an actual hardware processor processor: an entity which can do work, or in terms of this example a lisp subprocess which performs a task on a CPU

There can (and often will) be more processors than CPUs, though if there are many more processors than CPUs then tasks might take much longer than expected to complete.

In the "Example" section, there is an example which is run with a varying number of processors. For each run, there is an idea of the single processor time it would take to complete. This is labeled "WORK" in the test run below. In a theoretical sense, if there was 100 seconds of work to be done, 4 CPUs and the tasks the right size and independent of each other, you might get the work done in 25 seconds of real time.

Here is an example run on a 4 processor Opteron system. Each processor is running at 1.8GHz.

cl-user(2): (run) Detected 4 CPUs Iterations 40, processors 2, WORK: 8.0 seconds, REAL TIME: 4.164 Iterations 40, processors 3, WORK: 12.0 seconds, REAL TIME: 4.207 Iterations 40, processors 4, WORK: 16.0 seconds, REAL TIME: 4.18 Iterations 40, processors 5, WORK: 20.0 seconds, REAL TIME: 5.905 Iterations 40, processors 6, WORK: 24.0 seconds, REAL TIME: 7.075 Iterations 40, processors 7, WORK: 28.0 seconds, REAL TIME: 7.811 Iterations 40, processors 8, WORK: 32.0 seconds, REAL TIME: 8.717 Iterations 40, processors 9, WORK: 36.0 seconds, REAL TIME: 9.549 Iterations 40, processors 10, WORK: 40.0 seconds, REAL TIME: 10.525 Iterations 40, processors 11, WORK: 44.0 seconds, REAL TIME: 11.529 Iterations 40, processors 12, WORK: 48.0 seconds, REAL TIME: 12.535 nil cl-user(3):

We can see the work is well distributed over the actual CPUs and the real time to complete the work is roughly work / cpus.

Now, let's look at a Dual 2.4GHz Xeon system. Due to hyperthreading the Linux kernel believes there are 4 processors on this system.

cl-user(2): (run) Detected 4 CPUs Iterations 40, processors 2, WORK: 8.0 seconds, REAL TIME: 4.318 Iterations 40, processors 3, WORK: 12.0 seconds, REAL TIME: 5.452 Iterations 40, processors 4, WORK: 16.0 seconds, REAL TIME: 12.303 Iterations 40, processors 5, WORK: 20.0 seconds, REAL TIME: 14.495 Iterations 40, processors 6, WORK: 24.0 seconds, REAL TIME: 11.512 Iterations 40, processors 7, WORK: 28.0 seconds, REAL TIME: 17.971 Iterations 40, processors 8, WORK: 32.0 seconds, REAL TIME: 19.951 Iterations 40, processors 9, WORK: 36.0 seconds, REAL TIME: 26.855 Iterations 40, processors 10, WORK: 40.0 seconds, REAL TIME: 30.471 Iterations 40, processors 11, WORK: 44.0 seconds, REAL TIME: 27.072 Iterations 40, processors 12, WORK: 48.0 seconds, REAL TIME: 33.404 nil cl-user(3):

Not nearly as good as the first system, which as it happens costs about 10 times as much. For 12 processors it was close to 3 times as fast.

As was said at the outset, this approach isn't for every application. It can be very useful without a lot of trouble. The main benefit is that in the presence of multiple processors, an application can increase efficiency without a lot of work. The downside, which can be overcome with good programming techniques, is that debugging is more difficult using this approach. This is not that big a deal, since any serious server application will need to employ these same error recovery techniques.

Source Code

View or download.