Thanks to nine processors on a single silicon die, the Cell Broadband Enginea processor jointly designed by IBM, Sony, and Toshiba and used in the PlayStation 3promises lots of power. The good news is that the Cell is really fast: It provides enough computational power to replace a small high-performance cluster. The bad news is that it's difficult to program: Software that exploits the Cell's potential requires a development effort significantly greater than traditional platforms. If you expect to port your application efficiently to the Cell via recompilation or threads, think again.

In this article, we present strategies we've used to make a Breadth-First Search on graphs as fast as possible on the Cell, reaching a performance that's 22 times higher than Intel's Woodcrest, comparable to a 256-processor BlueGene/L supercomputerand all this with just with a single Cell processor! Some techniques (loop unrolling, function inlining, SIMDization) are familiar; others (bulk synchronous parallelization, DMA traffic scheduling, overlapping of computation and transfers) are less so.

Computing Is Changing

In the last 10 years, processors are faster mainly due to increasing clock frequencies or more complex architectures. The trend can't continue because fabrication technologies are reaching physical limits. Transistors are getting so small that a gate is only a few atoms thick. Additionally, smaller circuits means higher heat production: It's more and more difficult to remove heat fast enough to avoid circuit burndown.

This is why the computing community is so interested in multicore architectures: IBM is pushing the Cell, and AMD and Intel quad-core processors. Intel also has shown its TeraScale prototype, a single chip with 80 cores. Architectures are changing fast, and developers have to keep up.