Note: in these posts, I’m not going to go into comprehensive detail about how ispc/volta works on a technical level, what the precursors were, or what all of the key design goals were. I’ll touch on some of that as is relevant to the narrative, but see the paper I wrote about ispc with Bill Mark for comprehensive discussion of all that. It’s worth a read (IMHO), if you’re taking the time to read this. (Bill will appear later in this story.)

With the defensive perimeter established, I was off. There was a long way to go to turn volta into something generally useful. Plenty of basic language functionality wasn’t implemented yet—for example, I’m pretty sure things like structs weren’t even working at that point.

Throughout the design and implementation of volta, I thought a lot about C. I re-read K&R for inspiration for the n’th time along the way.

It wasn’t just the availability of a C grammar that made me start psl with C: I like C a lot; it’s a wonderfully crisp language. It was obvious to me that I’d continue using C as a basis for volta, deviating from it as little as possible. Not only had Kernighan and Ritchie obviously gotten so many things right, but the language is well-known, and if volta was ever finished, it would be more easily adopted by people if the syntax was familiar.

I took much more than syntax from C—its design principles were much more important. C had a close mapping to the hardware of the day, and a good programmer can look at C code and pretty accurately guess which instructions a compiler will generate for that code. No mystery, no compiler magic, no innocuous looking statements that may explode into a slew of instructions (I’m looking at you, C++). I wanted volta to maintain that.

I imagined Kernighan and Ritchie designing C in today’s world. How would C be different if it was designed for today’s CPUs? Two big things had changed in CPU architecture: multi-core processing and SIMD vector units.

For multi-core, there were plenty of good ideas to draw from. Andrew Lauritzen, who knew about all of them and had thought a lot about this stuff, was a big fan of Cilk’s approach to multi-threading—functions could call other functions asynchronously, those could run in separate threads, and the compiler waited for them to all finish before the original function could return. This nicely enabled parallel composition.

So I added a launch keyword to volta; it used Cilk’s semantics. Put it before a function call and the function goes off to a thread pool:

launch foo(a, 6.3);

Even though that’s not that much syntactically cleaner than calling out to TBB or the like (especially now, with C++11 lambdas), it felt nice to have that as a first-class thing in the language. It was basically zero friction multi-threading, which seemed right for this era.

For SIMD, one obvious option would be to expose that capability of the CPU using explicit vector data types—this is basically what lots of people have done by hand, wrapping intrinsics around vec4f classes and the like. That would certainly be useful to have as a first-class feature of the language, and for some types of computation, explicit vectors end up being a cleaner way of expressing them.

As should already be clear by now, I really wanted to write programs with complex control flow that still ran on SIMD hardware; for that, explicit vectors aren’t very convenient, and so SPMD on SIMD it was. I think that is a good fit for the philosophy of C as well: straightforward and predictable, no deep compiler magic behind it.

Maybe K&R would have decided to made both options available, and having both often would have been nice; then, for example, one could write a program that targeted 16-wide AVX-512, with 4 SPMD program instances, each of which could do a 4-wide vector operation with each instruction. We’ll come back to that topic later, in the retrospective.

Making SPMD on SIMD happen

As I experienced in Sweden, vectorizing straight-line code is easy—there’s nothing to it if you’re not trying to prove that vectorization is safe first. The trickier part is implementing general control flow for SPMD programs. We’d like different SPMD program instances to take different paths through the program, and still compute correct results.

Intrinsics programmers know how this is done: when conditionally processing a values in a vector, you maintain an additional mask variable that records which of them should be modified. If you have something that’s logically like:

if (a < b) c = 0;

operating on vector-valued a , b , and c , then you store a mask that records the vector result of a < b and then use that to conditionally assign zero to the c vector. If there’s an else statement, then you negate the mask and then execute its code, minding the mask. Kayvon Fatahalian has a great set of slides that talks about this and how it’s handled on GPUs; it’s all surprisingly similar, with a little more help from the hardware.

More generally, loops, break and continue statements, even indirect function calls through pointers—all of that can be executed in vector form by using the same idea:

Maintain an execution mask that records which program instances (SIMD lanes) are active.

Execute vector instructions according to a conservative control flow path through the program. In other words, execute an instruction if any of the lanes needs to.

Make sure that no side effects are visible from inactive program instances—unintended memory writes or the like.

Each of these can be a bit fiddly to get right, but they’re conceptually straightforward principles.

The rules to maintain the execution mask for loops are only slightly more complicated than for an if statement. The value of the loop test gives the execution mask for the loop body and you run the loop until that’s false for all of the active program instances. A break in a loop just disables the active mask for any elements where the mask is active when the break statement executes; the active mask for them is restored after the loop finishes. A continue disables the mask for an instance until the end of the current iteration, at which point it’s restored. And so forth.

It took a while to get this mask maintenance stuff implemented properly in volta. It was thrilling once it was all really solid, especially since LLVM continued to be reliable, giving me good x86 assembly as long as I gave it good vectorized IR.

Here’s a small example with a volta/ispc program that raises a float to an integer power with an inefficient algorithm. (Note that this program is also valid C.)

float powi(float a, int b) { float r = 1; while (b--) r *= a; return r; }

And here’s the assembly code that comes out of the compiler today. (Note: AT+T syntax, with the destination as the last argument.) Here I’ve used AVX2, since it’s cleaner than SSE4, though SSE4 was the only ISA that volta initially supported.

LBB0_3: vpaddd %ymm5, %ymm1, %ymm8 vblendvps %ymm7, %ymm8, %ymm1, %ymm1 vmulps %ymm0, %ymm3, %ymm7 vblendvps %ymm6, %ymm7, %ymm3, %ymm3 vpcmpeqd %ymm4, %ymm1, %ymm8 vmovaps %ymm6, %ymm7 vpandn %ymm6, %ymm8, %ymm6 vpand %ymm2, %ymm6, %ymm8 vmovmskps %ymm8, %eax testl %eax, %eax jne LBB0_3

The first two instructions decrement b , using the active active vector mask to perform the assignment only for the active lanes. The next two multiply r by a , again using the mask. Then b is compared for equality with zero and the result is used to update the execution mask. (Updating the mask takes an extra instruction due the fact that powi() may have been called with a not-all-on execution mask, so we have to AND the mask at powi() ’s entry. In this case, it’d be fine if we skipped that and computed bogus results for the disabled lanes, but in general we need an accurate mask in case there’s something like a memory write that needs to be squashed for inactive lanes.) Finally, a quick check with movmsk to see if any lanes are still active before jumping to the start of the loop again.

And that’s it. I think it’s optimal, save for the unneeded precision in the maintenance of the execution mask. I’ll happily take the occasional stray extra instruction over having to write intrinsics by hand, especially for programs that are non-trivial.

Compiler optimizations versus transformations

Having seen this example, it’s easier to understand another super insightful point from Tim Foley: compiling a SPMD program for SIMD hardware is a compiler transformation, which is an entirely different thing than a compiler optimization.

His insight gets back to the problem with auto-vectorization: it’s a complex optimization, full of heuristics, and you can’t be sure where it’s going to end up. The compiler is trying to reason about the safety of vectorizing a loop—are there any loop-carried dependencies? One can only go so far with reasoning about arbitrary programs with a computer program (remember that pesky halting problem), so auto-vectorizers are doomed to be brittle and unpredictable to the user.

SPMD on SIMD? That’s a transformation. We just saw how to do it. It’s mechanical. If you’ve automated it, there’s no reason it won’t always work.

The nice thing about this is that it fits with the C philosophy: any programmer who understands the idea can accurately predict what code the compiler will generate, and it’s pretty much the code they would have written by hand; for performance-aware programmers, the first property is just as important as the second.

In the end, volta’s kind of a dumb compiler; there’s not really anything deeply clever in its implementation. Other than a lot of engineering work, the key was approaching the problem in the right way in the first place.

Next time, we’ll finally get to sharing first results with the compiler team.

Next: First benchmark results

notes