We have evaluated the performance of our EMAS skeleton on two different optimisation examples that typify agent-based evolutionary computations:

1. Finding the minimum of the Rastrigin function, a popular global optimisation benchmark [15]; 2. A complex problem of urban traffic optimisation, a multi-variant simulation [21] that is applied to the mobile robotics domain. Our model anticipates possible situations on the road, preparing plans to deal with certain situations and apply these plans when certain traffic conditions arise.

Since the algorithms are bounded by time, we measure the average rate of reproductions that the algorithm achieves per second, representing its throughput. We also measure the increase in throughput as the number of cores increases. This corresponds to an increase in the application’s performance (speedup). We ran each experiment a total of 10 times, with each experiment set to take 5 min. The number of islands that the algorithm is able to use is set to be the number of cores. We prioritised the evaluation of the hybrid configuration, since it has proven itself to be the best of the three parallel configurations, but have included other configurations in our results where available. For all experiments, we have used version 18 of the Erlang system, except where otherwise highlighted. In order to evaluate the EMAS skeleton on a number of different architectures and in different settings, our experiments have been conducted on five different systems, ranging from low-power small-scale ARM systems (pi) through more powerful \(\times \)86 (titanic, zeus) and (power) OpenPower multicores to a standalone Intel Xeon Phi many-core accelerator. Full details of the systems that we have used are given in Table 1. We focus on measuring the scalability of the system instead of the sequential performance. We have carried out preliminary tests comparing existing models to a purely sequential version, however the performance differences were insignificant in comparison to benefits introduced by parallelism.

Table 1 Experimental platforms Full size table

Fig. 4 Reproductions per second (rps, above) and throughput increase (below) for traffic and rastrigin on titanic Full size image

titanic Results

The titanic system is an example of a medium-scale parallel server, with 24 cores and 32 GB of RAM. Figure 4 shows the number of reproductions per second (rps) and the increases in throughput for Traffic and Rastrigin on this machine. For both applications, we observe that the Hybrid version achieves almost linear increases in throughput (and, therefore, performance). For Rastrigin, the Hybrid version achieves 236,140 rps on 24 cores versus 10,038 rps on one core, yielding 23.52\(\times \) the throughput (or 23.64\(\times \) the base sequential throughput). For Traffic, the Hybrid version on titanic achieves 227,509 rps on 24 cores versus 10,059 rps on one core, yielding 22.62\(\times \) the throughput (or 22.76\(\times \) the base sequential throughput of 9996 rps). The Skel and Concurrent versions initially also perform well, but start to tail off after about 16 cores (islands) for both applications. This is consistent with other applications we have run on this machine, and is probably due to cache contention. Although the Hybrid version is always best, the untuned Skel version always performs better than the Concurrent version. This shows the benefit of reducing the number of Erlang processes by grouping the computations that multiple agents perform into a single process.

zeus Results

The zeus system is an example of a larger-scale multicore system, with 64 cores and 256 GB of RAM. Figure 5 shows the rps and corresponding throughput increase for all the tested versions. Here, Rastrigin achieves 349,226 rps on 64 cores versus 5752 rps on one core, for a throughput increase of 60.71\(\times \) the sequential version. Traffic achieves 94,950 rps on 64 cores versus 1473 rps on one core, for a throughput increase of 64\(\times \) the sequential version. There is clearly a tail-off in throughput increase for Traffic at 32 and 48 cores, but further experiments are required to explain the reasons for this, given the clear improvement with 64 cores. A slight dip can also be observed for Rastrigin at 32 cores (27.8\(\times \) throughput increase), but this has recovered at 48 cores (46.13\(\times \) throughput increase). In comparison with titanic, the raw performance per core is lower for both applications (despite a nominally similar processor architecture), but zeus delivers higher total throughput.

Fig. 5 Reproductions per second (rps, above) and throughput increase (below) for traffic and rastrigin on zeus Full size image

phi Results

Our motivation for considering the phi system was to evaluate the performance of our skeletons (and, generally, of the Erlang runtime system) on a many-core accelerator. We have therefore focused on using just the accelerator, without also using the multi-core host system. Figure 6 shows the reproductions per second and throughput increase for the Hybrid version of the two use cases. We were unfortunately unable to run experiments with the Concurrent and Skel versions on the Xeon Phi as detailed adaptation of the Concurrent and Skel code for Xeon Phi was required, and we wanted to retain code compatibility among the significant number of different processors we used during the experiments. This is left as an area of future work. Note that the system has 61 physical cores, but can run 244 simultaneous threads via 4 way hyper-threading. Due to the sharing of resources between multiple threads, we do not expect to see 244 times increase in throughput on this architecture. Indeed, we can clearly see that the performance improvement decreases above 61 cores and tails off when more than 122 threads are used. Even so, we were able to achieve very good results, improving performance more than 100 times (compared to the sequential version) when 244 threads were used for both applications, and achieving excellent improvements up to 61 cores for both applications. Here, Rastrigin achieves 188,240 rps on 244 cores versus 1681 rps on one core, for a throughput increase of 111.96\(\times \) the sequential version. Traffic achieves 42,902 rps on 244 cores versus 302 rps on one core, for a throughput increase of 142.44\(\times \) the sequential version.

Fig. 6 Reproductions per second (rps, above) and throughput increase (below) for traffic and rastrigin on phi Full size image

The number of islands in the experiments on zeus and phi was fixed at the level of the number of available logical cores. This approach lead to testing exactly the same algorithm on different number of cores, which has been adjusted by setting the number of schedulers of the Erlang VM. The side-effect of this approach was that the initialisation of the islands took proportionally more time with the decrease of the number of cores, influencing the measured throughput. This overhead caused the super-linearity effect on phi—the initialisation time was shorter for greater number of cores.

Preliminary power Results

The power system represents the IBM Power8 architecture, which is designed to allow a highly multi-threaded chip implementation. Each core is capable of handling 8 hardware threads, which gives a total of 160 threads on a system with 20 physical cores. At 3.69Ghz, the clock frequency is also noticeably higher than for the other systems that we have considered. However, as with the Xeon Phi, resources are shared between multiple threads and we therefore do not see a linear increase in throughput with the number of cores (islands). Due to problems with running Rastrigin, we only present the results for the Traffic use case (Fig. 7). Since the latest version of Erlang (Erlang 18) is currently unsupported on the OpenPower architecture, our experiments use an older version, Erlang 16. Our tests on other systems suggest that this dramatically lowers the absolute throughput of the system (by about a factor of 3). Overall, we achieve 172,078 rps on 160 cores versus 1,681 rps on one core, for a throughput increase of 39.53\(\times \) the sequential version. We observe very good scaling up to about 40 threads (2 per physical core), achieving throughput improvement of 25.75 on 32 cores and 34.76 on 64 cores. Beyond that, as with the phi system, performance improvements tail off rapidly, but smoothly. Currently we were only able to port the hybrid versions of the EMAS skeleton to power, with other versions planned as future work.

Fig. 7 Reproductions per second (rps, above) and throughput increase (below) for traffic on power (Erlang 16) Full size image

Fig. 8 Reproductions per second (rps, above) and throughput increase (below) for traffic and rastrigin on pi Full size image

pi Results

The pi system is a quad-core Raspberry Pi 2, model B. We have chosen it as an example of a low-power parallel architecture, that is intended to be used in e.g. high-end embedded systems. While we expect the absolute performance of this system to be significantly lower than the other, heavyweight, parallel systems we have studied, it is still interesting to in evaluating how well our skeletons perform on such a system. Figure 8 shows the results for pi. We have considered all three versions of the EMAS skeleton. As for titanic, the Hybrid version gives the best performance, and the Skel version outperforms the Concurrent version. For Traffic, the Hybrid version on titanic achieves 5962 rps on 4 cores versus 1568 rps on one core, yielding 3.92\(\times \) the throughput. For Rastrigin, the Hybrid version achieves almost identical results of 5954 rps on 4 cores versus 1556 rps on one core, also yielding 3.92\(\times \) the throughput. We observe similar results for the other two versions: with throughput improvements of 3.54/3.54 for the Skel version on Traffic/Rastrigin (5387/5401 rps), and 2.93 for the Concurrent version (4453/4447 rps). Although, as expected, the absolute peak performance (about 6000 reproductions per second) was orders of magnitude weaker than on other systems; however, it is worth noting that Pi power consumption is just a fraction of these of heavyweight servers, so it gives the best performance-per-watt ratio of all systems we tested.