The state of the art in optimizing compilers today is such that for optimizing code, you need (1) a strong optimizing compiler and (2) a strong optimizing human. My rule of thumb is that (1) alone will yield 2x to 10x slower code. This is also what a person selling a (great) compiler "giving 80% of the optimal performance with no manual intervention" once told off-record to a roomful of programmers who pressed him into a corner, elevating my rule of thumb to a nobler plane of anecdotal evidence.

Now, I claim that this situation will persist, and in this post I'll try to close the fairly large gap between this claim and the mere acknowledgment of what the state of the art is today. The gap is particularly large for the believer in the possibility of strong AI – and while my position is a bit different, I do believe in fairly strong AI (can I say that? people keep telling that I can't say "nearly context-free". oh well.)

I realize that many people experienced in optimization feel that, on the contrary, there's in fact no gap large enough to justify an attempt as boringly rigorous (for a pop tech blog) at proving what they think is obvious as will shortly follow. But I think that many language geek discussions could benefit from a stronger bound on the power of a Sufficiently Smart Compiler than can be derived from (necessarily vague) doubts on the power of AI, and in this post I'll try to supply such a bound. I actually think a lot of (mainly domain-specific) things could be achieved by AI-ish work on compilation – closer to "identify bubble-sort and convert to quick-sort" than to traditional "analyze when variables are alive and assign them to registers" – and this is why it's useful to have a feeling when not to go there.

So, consider chess, where the state of the art is apparently quite similar to that in optimization: a strong human player using a strong computer program will take out both a human and a computer playing alone. However, it is conceivable that a program can be developed that doesn't need the help of a human, being able of completely simulating human thought processes or instead alternative processes which are consistently superior. Why can't it be the same with optimizing compilers?

(Chess and optimization are similar in another respect – few care about them; I readily acknowledge the insignificance of a 10x speed-up in a continuously expanding set of circumstances, I just happen to work in an area where it does count as a checkmate.)

I'll try to show that optimization is a fundamentally different game from chess, quite aside from the formal differences such as decidability. I'll use optimizing for VLIW SIMD processors to show where compilers outperform humans and vice versa. I'll be quoting a book by the inventor of VLIW called "Embedded Computing: A VLIW Approach" to support my position on the relative strength of humans and compilers in these cases. I'll then try to show that my examples are significant outside the peculiarities of current hardware, and attempt to state the general reason why humans are indispensable in optimization.

VLIW SIMD

First, we'll do the acronym expansion; skip it if you've been through it.

VLIW stands for "Very Long Instruction Word". What it really means is that your target processor can be told to execute several instructions in parallel. For example: R0=Add R1,R2 and R3=Mul R0,R1 and R1=Shift R5,R6. For this to work, the processor ought to be able to add, multiply and shift in parallel, that is, its execution hardware must be packed into several units, each getting distinct inputs. The units can be completely symmetric (all supporting the same operations); more often, different units support different instruction sets (so, for example, only one unit in a processor can multiply, but two of them can add, etc.) A stinky thing to note about VLIW instructions is the register semantics. In the example instruction above, R0 is mentioned both as an input and as an output. When it's mentioned as an input of Mul its old value is meant, and not the value computed by Add. This is somewhat natural since the whole point is to run Add and Mul in parallel so you don't want Mul to wait for Add; but it's confusing nonetheless. We'll come back to this shortly.

SIMD stands for "Single Instruction, Multiple Data" and is known much more widely than VLIW, being available at desktop and server processor architectures like x86 and PowerPC (VLIW reigns the quieter embedded DSP domain, the most commercially significant design probably being TI's C6000 family.) SIMD means that you have commands like R0=Add8 R1,R2, which does 8 additions given 2 inputs. The registers are thus treated as vectors of numbers – for example, uint8[16], or uint16[8], or uint32[4], assuming 16b registers. This establishes a preference for lower-precision numbers since you can pack more of them into a register and thus process more of them at a time: with uint16, you use Add8, but with uint8, you get to use the 2x faster Add16. We'll come back to this, too.

Optimizing for VLIW targets



The basic thing at which VLIW shines is the efficient implementation of "flat" loops (where most programs spend most time); by "flat", I mean that there are no nested if/elses or loops. The technique for implementing loops on VLIW machines is called modulo scheduling. The same technique is used on superscalar machines like modern x86 implementations (the difference from VLIWs being the instruction encoding semantics).

Since I couldn't find a good introductory page to link to, we'll run through a basic example of modulo scheduling right here. The idea is pretty simple, although when I first saw hardware designers doing it manually in a casual manner, I was deeply shocked (they do it for designing new hardware rather than programming existing hardware but it's the same principle).

Suppose you want to compute a[i]=b[i]*c+d on a VLIW processor with 4 units, 2 of them capable of load/store operations, 1 with an adder and 1 with a multiplier. All units have single-cycle latency (that is, their output is available to the next instruction; real VLIW units can have larger latencies, so that several instructions will execute before the result reaches the output register.) Let's assume that Load and Store increment the pointer, and ignore the need to test for the exit condition through the loop. Then a trivial assembly implementation of a[i]=b[i]*c+d looks like this:

LOOP:

R0=Load b++

R1=Mul R0,c

R2=Add R1,d

Store a++,R2

This takes 4 cycles per iteration, and utilizes none of the processor's parallelism as each instruction only uses 1 of the 4 execution units. Presumably we could do better; in fact the upper bound on our performance is 1 cycle per iteration, since no unit has to be used more than once to implement a[i]=b[i]*c+d (if we had two multiplications, for example, then with only 1 multiplying unit the upper bound would be 2 cycles/iteration.)

What we'll do now is blithely schedule all of the work to a single instruction, reaching the throughput suggested by our upper bound:

LOOP:

R0=LOAD b++ and R1=MUL R0,c and R2=ADD R1,d and STORE a++,R2

Let's look at what this code is doing at iteration N:

b[ N ] is loaded

] is loaded b[ N-1 ] (loaded at the previous iteration into R0) is multiplied by c

] (loaded at the previous iteration into R0) is multiplied by c b[ N-2 ]*c (computed at the previous iteration from the old value of R0 and saved to R1) is added to d

]*c (computed at the previous iteration from the old value of R0 and saved to R1) is added to d b[N-3]*c+d is saved to a[N]

This shows why our naive implementation doesn't work (it would be quite surprising if it did) – at iteration 0, b[N-1] to b[N-3] are undefined, so it makes no sense to do things depending on these values. However, starting at N=3, our (single-instruction) loop body seems to be doing its job just fine (except for storing the result to the wrong place – b ran away during the first 3 iterations). We'll take care of the first iterations by adding a loop header – instructions which implement the first 3 iterations, only doing the stuff that makes sense in those iterations:

R0=Load b++

R0=Load b++ and R1=Mul R0,c

R0=Load b++ and R1=Mul R0,c and R2=Add R1,d

LOOP:

R0=Load b++ and R1=Mul R0,c and R2=Add R1,d and Store a++,R2

For similar reasons, we need a loop trailer – unless we don't mind loading 3 elements past the end of a[], but I reckon you get the idea. So we'll skip the trailer part, and move to the more interesting case – what happens when the loop body won't fit into a single instruction. To show that, I can add more work to be done in the loop so it won't fit into the units, or I can use a weaker imaginary target machine to do the same work which will no longer fit into the (fewer) units. The former requires more imaginary assembly code, so I chose the latter. Let's imagine a target machine with just 2 units, 1 with Load/Store and one with Add/Mul. Then our upper bound on performance is 2 cycles per iteration. The loop body will look like this:

LOOP:

R0=Load b++ and R2=Add R1,d

R1=Mul R0,c and Store a++,R2

Compared to the single-instruction case, which was still readable ("Load and Mul and Add and Store"), this piece looks garbled. However, we can still trace its execution and find that it works correctly at iteration N (assuming we added proper header code):

At instruction 1 of iteration N, b[ N ] is loaded

of iteration N, b[ ] is loaded At instruction 2 of iteration N, b[ N ] (loaded to R0 by instr 1 of iter N) is multiplied by c

of iteration N, b[ ] (loaded to R0 by instr 1 of iter N) is multiplied by c At instruction 1 of iteration N, b[ N-1 ]*c (computed in R1 by instr 2 of iter N-1) is added to d

of iteration N, b[ ]*c (computed in R1 by instr 2 of iter N-1) is added to d At instruction 2 of iteration N, b[N-1]*c+d (computed in R2 by instr 1 of iter N) is stored to a[N]

In common VLIW terminology, the number of instructions in the loop body, known to the rest of humanity as "throughput", is called "initiation interval". "Modulo scheduling" is presumably so named because the instructions implementing a loop body are scheduled "modulo initiation interval". In our second example, the operations in the sequence Load, Mul, Add, Store go to instructions 0,1,0,1 = 0%2,1%2,2%2,3%2. In our first example, everything goes to i%1=0 – which is why I needed an example with at least 2 instructions in a loop, "modulo 1" being a poor way to illustrate "modulo".

In practice, "modulo scheduling" grows more hairy than simply computing the initiation interval, creating a linear schedule for your program and then "wrapping it around" the initiation interval using %. For example, if for whatever reason we couldn't issue Mul and Store at the same cycle, we could still implement the loop at the 2 cycles/iteration throughput, but we'd have to move the Mul forward in our schedule, and adjust the rest accordingly.

I've done this kind of thing manually for some time, and let me assure you that fun it was not. An initiation interval of 3 with 10-15 temporary variables was on the border of my mental capacity. Compilers, on the other hand, are good at this, because you can treat your input program as a uniform graph of operations and their dependencies, and a legal schedule preserving its semantics is relatively easy to define. You have a few annoyances like pointer aliasing which precludes reordering, but it's a reasonably small and closed set of annoyances. Quoting "Embedded Computing: A VLIW Approach" (3.2.1, p. 92): "All of these problems have been solved, although some have more satisfyingly closed-form solution than others." Which is why some people with years of experience on VLIW targets know almost nothing about modulo scheduling – a compiler does a fine job without their help.

The book goes on to say that "Using a VLIW approach without a good compiler is not recommended" – in other words, a human without a compiler will not perform very well. Based on my experience of hand-coding assembly for a VLIW, I second that. I did reach about 95% of the performance of a compiler that was developed later, but the time it took meant that many optimizations just wouldn't fit into a practical release schedule.

Optimizing for SIMD targets

I will try to show that humans optimize well for SIMD targets and compilers don't. I'll quote "Embedded Computing: A VLIW Approach" more extensively in this section. A book on VLIW may not sound like the best source for insight on SIMD, however, I somewhat naturally haven't heard of a book on SIMD stressing how compilers aren't good at optimizing for it. But then I haven't heard of a book stressing the opposite, either, and success papers I saw claimed at automatic vectorization was modest. Furthermore, the particular VLIW book I quote is in fact focusing on embedded DSP where SIMD is ubiquitous, and its central theme is the importance of designing processors in ways making them good targets for optimizing compilers. It sounds like a good place to look for tips on designing compilers to work well with SIMD and vice versa; and if they say they have no such tips, it's telling.

And in fact the bottom line of the discussion on SIMD (which they call "micro-SIMD") is fairly grim: "The ability of compilers to automatically extract micro-SIMD without hints (and in particular, without pointer alignment information) is still unproven, and manual code restructuring is still necessary to exploit micro-SIMD parallelism" (4.1.4, p. 143). This statement from 2005 is consistent with what (AFAIK) compilers can do today. No SIMD-targeted programming environment I know relieves you of the need to use intrinsics in your C code as in "a = Add8(b,c)", where Add8 is a built-in function-looking operator translated to a SIMD instruction.

What I find fascinating though is the way they singled out pointer alignment as a particularly interesting factor necessitating "hints". Sure, most newbies to SIMD are appalled when they find out about the need to align pointers to 16 bytes if you want to use instructions accessing 16 bytes at a time. But how much of a show-stopper can that be if we are to look at the costs and benefits more closely? Aligning pointers is easy, producing run time errors when they aren't is easier, telling a compiler that they are can't be hard (say, gcc has a __vector type modifier telling that), and alternatively generating two pieces of code – optimized for the aligned case and non-optimized for the misaligned case – isn't hard, either (the book itself mentions still other option – generating non-optimized loop header and trailer for the misaligned sections of an array).

There ought to be more significant reasons for people to be uglifying their code with non-portable intrinsics, and in fact there are. The book even discusses them in the pages preceeding the conclusion – but why doesn't it mention the more serious reasons in the conclusion? To me this is revealing of the difference between a programmer's perspective and a compiler writer's perspective, which is related to the difference between optimization and chess: in chess, there are rules.

For an optimizing programmer, SIMD instructions are a resource from which most benefit must be squeezed at any reasonable cost, including tweaking the behavior of the program. For an optimizing compiler, SIMD instructions are something that can be used to implement a piece of source code, in fact the preferable way to implement it – as long as its semantics are preserved. This means that a compiler obeys rules a programmer doesn't, making winning impossible. A typical reaction of a compiler writer is to think of this as not his problem – his problem ending where program transformations preserving the semantics are exhausted. I think this is what explains the focus on things like pointer alignment (which a compiler can in fact solve with a few hints and without affecting the results of the program) at the expense of the substantive issues (which it can't).

In the context of SIMD optimizations, the most significant example of rules obeyed by just one of the contestants has to do with precision, which the book mentions right after alignment in its detailed discussion of the problems with SIMD. "Even when we manipulate byte-sized quantities (as in the case of most pixel-based images, for example), the precision requirements of the majority of manipulation algorithms require keeping a few extra bits around (9, 12, and 16 are common choices) for the intermediate stages of an algorithm. …this forces us up to the next practical size of sub-word … reducing the potential parallelism by a factor of two up front." They go on to say that a 32b register will end up keeping just 2 16b numbers, giving a 2x speed-up – modest considering all the cases when you won't get even that due to other obstacles.

This argument shows the problems precision creates for the hardware implementation of SIMD. However, the precision of intermediate results isn't as hard a problem as this presentation makes it sound, because intermediate results are typically kept in registers, not in memory. So to keep the extra bits in intermediate results, you can either use large registers for SIMD operations and not "general-purpose" 32b ones, or you can keep intermediate results in pairs of registers – as long as you have enough processing units to generate and further process these intermediate results. Both things are done by actual SIMD hardware.

However, the significant problems created by precision lie at the software side: the compiler doesn't know how many bits it will need for intermediate results, nor when precision can be traded for performance. In C, the type of the intermediate results in the expression (a[i]*3+b[i]*c[i])>>d is int (roughly, 32b), even if a, b and c are arrays of 8b numbers, and the parenthesized expression can in fact exceed 16b. The programmer may know that b[i]*c[i] never exceeds, say, 20000 so the whole thing will fit in 16b. That C has no way of specifying precise ranges of values a variable can hold (as opposed to Lisp, of all rivals to the title of the most aggressively optimizing environment) doesn't by itself make an argument since a way could be added, just like gcc added __vector, not to mention the option of using a different language. Specifying the ranges of b[i] and c[i] wouldn't always suffice and we would have to further uglify the code to specify the range of the product (in case both b[i] and c[i] can be large by themselves but never together), but it could be done.

The real problem with having to specify such information to the compiler isn't the lack of a standard way of spelling it, but that a programmer doesn't know when to do it. If it's me who is responsible for the low-level aspects of optimization, I'll notice the trouble with an intermediate result requiring too many bits to represent. I will then choose whether to handle it by investigating the ranges of b[i] and c[i] and restricting them if needed, by moving the shift by d into the expression as in (a[i]*3>>d)+(b[i]*c[i]>>d) so intermediate results never exceed 16b, or in some other way. But if it's the compiler who's responsible, chances are that I won't know that this problem exists at all.

There's a trade-off between performance gains, precision losses and the effort needed to obtain more knowledge about the problem. A person can make these trade-offs because the person knows "what the program really does", and the semantics of the source code are just a rendering of that informal spec from one possible perspective. It's even worse than that – a person actually doesn't know what the program really does until an attempt to optimize it, so even strong AI capable of understanding an informal spec in English wouldn't be a substitute for a person.

A person can say, "Oh, we run out of bits here. OK, so let's drop the precision of the coefficients." Theoretically, and importantly for my claim, strong AI can also say that – but only if it operates as a person and not as a machine. I don't claim that we'll never reach a point where we have a machine powerful enough to join our team as a programmer, just that (1) we probably wouldn't want to and (2) if we would, it wouldn't be called a compiler, it would be called a software developer. That is, you wouldn't press a button and expect to get object code from your source code, you'd expect a conversation: "Hey, why do you need so many bits here – it's just a smoothing filter, do you really think anyone will notice the difference? Do you realize that this generates 4x slower code?" And then someone, perhaps another machine, would answer that yes, perhaps we should drop some of the bits, but let's not overdo it because there are artifacts, and I know you couldn't care less because your job ends here but those artifacts are amplified when we compute the gradient, etc.

This is how persons optimize, and while a machine could in theory act as a person, it would thereby no longer be a compiler. BTW, we have a compiler at work that actually does converse with you – it says that it will only optimize a piece of code if you specify that the minimal number of iterations executed is such and such; I think it was me who proposed to handle that case using conversation. So this discussion isn't pure rhetoric. I really wish compilers had a -warn-about-missed-optimization-opportunities switch that would give advice of this kind; it would help in a bunch of interesting cases. I just think that in some cases, precision being one of them, the amount and complexity of interactions needed to make headway like that exceeds the threshold separating aggressive optimization from aggressive lunacy.

To be sure, there are optimization problems that could be addressed by strong AI. In the case of SIMD, the book mentions one such area – they call it "Pack, Unpack, and Mix". "Some programs require rearranging the sub-words within a container to deal with the different sub-word layouts. From time to time, the ordering of the sub-words within a word (for example, coming from loading a word from memory) does not line up with the parallelism in the code… The only solution is to rearrange the sub-words within the containers through a set of permutation or copying operations (for example, the MIX operation in the HP PA-RISC MAX-2 extension)."

An example of this reordering problem is warping: computing a[i]=b[i*step+shift]. This is impossible to do in SIMD without a permutation instruction of the kind they mention (PowerPC's AltiVec has vec_perm, and AFAIK x86's SSE has nothing so you can't warp very efficiently). However, even if an instruction is available, compilers are AFAIK unable to exploit it. I see no reason why sufficiently strong AI couldn't manage to do such things with few hints in some interesting cases. I wouldn't bet my money on it – I side with Mitch Kapor on the Turing Test bet, but it is conceivable like the invincible chess playing program, and unlike transformations requiring "small" changes of the semantics.

Significance

There are areas of optimization that are very significant commercially but hardly interesting in a theoretical discussion (and this here's a distinctively theoretical discussion as is any discussion where the possibility of strong AI is supposed to be taken into account).

For example, register allocation for the x86 is exceedingly gnarly and perhaps an interesting argument could be made to defend the need for human intervention in this process in extreme cases (I wouldn't know since I never seriously optimized for the x86). However, a general claim that register allocation makes compiler optimization hard wouldn't follow from such an argument: on a machine with plentiful and reasonably uniform registers, it's hard to imagine what a human can do that a compiler can't do better, and almost everybody would agree that the single reason for not making hardware that way is a commercial one – to make an x86-compatible processor.

Now, I believe that both SIMD and VLIW instruction encodings don't have this accidental nature, and more likely are part of the Right Way of designing high-performance processors (assuming that it makes no sense to move cost from software to hardware and call that a "performance gain", that is, assuming that performance is measured per square millimeter of silicon). One argument of rigor worthy of a pop tech blog is that most high-end processors have converged to SIMD VLIW: they have instructions processing short vectors and they can issue multiple instructions in parallel; some do the latter in the "superscalar" way of having the hardware analyze dependencies between instructions at run time and others do it in the "actual VLIW" way of having the lack of dependencies proven and marked by the compiler, but you end up doing modulo scheduling anyway.

However, this can of course indicate uninformed consumer preference rather than actual utility (I type this on a noisy Core 2 Duo box running Firefox on top of XP, a job better handled by a cheaper, silent single-core – and I'm definitely a consumer who should have known better). So my main reasons for believing VLIW and SIMD are "right" are abstract considerations on building von Neumann machines:

You typically have lots of distinct execution hardware: a multiplier has little in common with a load/store unit. Up to a point, it will therefore make sense to support parallel execution of instructions on the different execution hardware. The cost of supporting it will be more I/O ports connecting the execution units with the register file – quite serious because of the multiplexers selecting the registers to read/write. However, the cost of not supporting it will be more execution hardware left unused for more time. So the optimum is unlikely to be "no parallel execution", it's likely "judicious parallel execution".

It is cheaper to have few wide registers and wide buses between the register file and the execution units than it is to have many narrow registers and buses. That's because the cost of the register file is proportional to the product of #registers and #buses to the execution units. It is thus significantly cheaper to have 1 unit with 4 8bx8b multipliers and 2 32b buses for the inputs then it is to have 4 units with 1 8bx8b multiplier in each and 8 8b buses for the inputs. It's also cheaper to keep 4 bytes in 1 32b register than in 4 8b registers. Likewise, it is cheaper to have 4 multipliers in 1 processor than to have 4 full-blown processor cores, because each core would have, say, its own fetch and decode logic and instruction cache – which are in fact pure overhead. So if you have a von Neumann machine with registers and buses and instruction cache, it makes sense (up to a point) to add SIMD to make the best of that investment, and this is why commercial VLIWs have SIMD, although the VLIW theory recommends more units instead.

Since I believe that both VLIW and SIMD are essential for maximizing hardware performance, I also tend to think that optimizations needed to utilize these features are "mainstream" enough to support a broad claim about optimization in general. And the main point of my claim is that compilers can't win in the optimization game, because part of that game is the ability to change the rules once you start losing.

Humans faced with a program they fail to optimize change the program, sometimes a little, sometimes a lot – I heard of 5×5 filters made 4×4 to run on a DSP. But even if we exclude the truly shameless cheating of the latter kind, the gentler cheating going into every serious optimization effort still requires to negotiate and to take responsibility in a way that a person – human or artificial – can, but a tool like a compiler can not.

Modulo scheduling is an example of the kinds of optimizations which in fact are best left to a compiler – the ones where rules are fixed: once the code is spelled, few can be added by further annotations by the author and hence the game can be won without much negotiations with the author; although sometimes a little interrogation can't hurt.