MultiLoop: Efficient Software Pipelining for Modern Hardware

By Christopher Kumar Anand and Wolfram Kahl

May 8, 2007

This paper is motivated by trends in processor models of which the Cell BE is an exemplar, and by the need to reliably apply multi-level code optimizations in safety-critical code to achieve high performance and small code size. A MultiLoop is a loop specification construct designed to expose in a structured way details of instruction scheduling needed for performance-enhancing transformations. We show by example how it may be used to make better use of underlying hardware features, including software branch prediction and SIMD instructions. In each case, the use of MultiLoop transformations allows us to take full advantage of software branch prediction to completely eliminate branch misses in our scheduled code, and reduce the cost of loop overhead by using SIMD vector instructions. Given the novelty of our representation, it is important to demonstrate feasibility (of zero branch misses) and evaluate performance (of transformations) on a wide set of representative examples from numerical computation. We include simple loops, nested loops, sequentially-composed loops, and loops containing control flow. In some cases we obtain significant benefits: halving execution time, and halving code size. As many details as possible are provided for other compiler writers wishing to adopt our innovative transformations, including instruction selection for SIMD-aware control flow. Read the full paper.