Parallel External Sorting

Using Scala on Desktop and Raspberry Pi 3B

This is first exercise in a series of exercises planned to explore distributed computation on a Rasp Pi cluster. Since this exercise is more about getting the ball rolling, it explores parallel computation on a desktop and a Rasp Pi.

Problem

Sort N integers that do not all fit in the memory of a single node with multiple compute units (processors/cores).

Solution

From the description, the solution is to use external sorting:

Split the the collection into k chunks (of p integers) that each fit into memory of a single node, i.e., N=k*p. Sort each chunk. Merge the sorted chunks into sorted collection.

Assuming the three steps happen in sequence, the complexity of this solution is N + p*log(p) * k + N * f(k) where

N units to scan and split the collection into chunks (step 1), k * p * log(p) units to sort k chunk of p integers each (step 2), and N * f(k) units to merge sorted chunks into the sorted collection (step 3).

While merging in step 3, the least integers in the remaining part of the remaining chunks will be compared to determine the next least integer in the entire collection; this costs f(k). This task can be accomplished by using a priority queue where the least integer has the highest priority. Specifically, the queue is initialized with the least integers from each chunk. The least integer (highest priority element) is removed from the queue and added to the sorted (output) collection. To facilitate the next comparison, the next least integer from chunk c to which the removed integer belonged to is added to the queue and the process is repeated. Consequently, f(k) = log(k) and the cost of step 3 will be N * log(k).

Since the nodes have multiple compute units, step 2 can be parallelized. If a node has l compute units, then the cost of step 2 reduces to k * p/l * log(p/l). Also, the l chunks sorted in parallel can be merged and written into a single file as opposed to being written into l files. With this change, the complexity of step 2 changes to k * (p/l * log(p/l) + p * log(l)) and the complexity of step 3 changes to N * log(k/l) as f(k) = log(k/l).

Implementation

I implemented the above solution in Scala and tested it using ScalaCheck. While this was my first outing with Scala, it was rather easy. Scala library docs were pretty good. IntelliJ’s Scala plugin was helpful. And, I guess experience with functional programming, F#, and property-based testing helped :)

Experiment

To evaluate the implementation, I created 4 sets of numbers containing 5M, 10M, 50M, and 100M numbers in the range -1e18 to 1e18. These sets took up 93MB, 185MB, 925MB, and 1.9GB of space on the disk, respectively.

The target machines were

Rasp Pi 3B with 1.2GHz 4-core Broadcom BCM2837 CPU, 1GB of RAM, and Class 10 SD card Desktop with 2.8GHz 8-core Intel i7 CPU, 16GB of RAM, and SSD drive

Given the hardware configuration of the machines, to force the implementation to process the integers in chunks, the max heap size of the JVM was constrained to 200M, 400M, and 800M.

To observe how the implementation scales with the number of compute units, the number of sorters was constrained to 1, 2, 4, and 8 (only on the desktop) along with respective size factors of 104, 64, 48, and 40 that determine the number of integers per sorter (= heap size / # of sorters / size factors).

The script to execute the implementation under all of the above configuration is available here.

Observations

In terms of compute, a single Rasp Pi is no match to a Desktop

Following are the (wall clock) run times on the desktop and Rasp Pi for the various combinations of different number of integers, number of sorters, and max heap size.