I have now got an opportunity to test the new Sandy Bridge processor from Intel, and the results are very interesting. There are many improvements - and few drawbacks. I have updated my manuals with the details, but let me just summarize the main findings here: New micro-op cache The decoders translate the CISC style instructions to RISC style micro-operations. The Sandy Bridge has a new cache for storing decoded micro-operations after the decoders, but the traditional code cache before the decoders is still there. The micro-op cache turns out to be very efficient in my tests. It is easy to obtain a throughput of 4, or even 5, instructions per clock cycle as long as the code fits into the micro-op cache.

Decoders While the throughput is improved quite a lot for code that fits into the micro-op cache, it is not improved in situations where the critical code is too big for the micro-op cache (but not too big for the level-1 code cache). The decoders in the Sandy Bridge are almost identical to the design in previous processors with the same limitation of 16 bytes per clock cycle. The maximum throughput of 4 or 5 instructions per clock cycle is rarely obtained. The difference in performance between code that fits into the micro-op cache and code that doesn't makes the micro-op cache a precious resource. It is so important to economize the use of the micro-op cache that I would give the advice never to unroll loops.

Macro-fusion There is one improvement in the decoders, though. It is possible to fuse two instructions into one micro-op in more cases than before. For example, an ADD or SUB instruction can be fused with a conditional jump into one micro-op. This makes it possible to make a loop where the overhead of the loop counter and exit condition is just one micro-op.

Branch prediction The branch predictor has bigger history buffers than in previous processors, but the special loop predictor is no longer there. The misprediction penalty is somewhat shorter for code that resides in the micro-op cache.

AVX instruction set The new AVX instruction set extends the vector registers from 128 bits to 256 bits. The floating point execution units have full 256-bit bandwidth. This means that you can do calculations on vectors of eight single-precision or four double-precision numbers with a throughput of one vector addition and one vector multiplication per clock cycle. I found that this doubled throughput is obtained only after a warm-up period of several hundred floating point operations. In the "cold" state, the throughput is only half this value, and the latencies are one or two clocks longer. My guess is that the Sandy Bridge is saving power by turning off the most expensive execution units when they are not needed, and it turns on the full execution power only when the load is heavy. This is my guess only - I have found no official mentioning of this warm-up effect.



Another advantage of the AVX instruction set is that all vector instructions now have a non-destructive version with three operands where the destination is stored in a separate register. Instead of A = A + B, we now have C = A + B, so that the value of A is not overwritten by the result. This saves a lot of register moves.



A disadvantage of the AVX instruction set is that all vector instructions now have two versions, a non-destructive AVX version and a two-operand non-AVX version, and you are not supposed to mix these two versions. If the programmer inadvertently mixes AVX and non-AVX vector instructions in the same code then there is a penalty of 70 clock cycles for each transition between the two forms. I bet that this will be a very common programming error in the future - and an error that is quite difficult to detect because the code still works, albeit slower.

More memory ports The Sandy Bridge has two memory read ports where previous Intel processors have only one. The maximum throughput is now 256 bits read and 128 bits write per clock cycle. The flipside of this coin is that the risk of contentions in the data cache increases when there are more memory operations per clock cycle. In my tests, it was quite difficult to maintain the maximum read and write throughput without being delayed by cache bank conflicts.

Misaligned memory operands handled efficiently On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.

Register read ports Previous Intel processors have a serious - and often neglected - bottleneck in the register read ports. Ever since the Pentium Pro processor back in 1995, the Intel family 6 processors have had a limitation of 2 or 3 reads from the permanent register file per clock cycle. This bottleneck has finally been removed in the Sandy Bridge.

Zeroing instructions An instruction that subtracts a register from itself will always give zero, regardless of the previous value of the register. This is traditionally a common way of setting a register to zero. Many modern processors recognize that this instruction doesn't have to wait for the previous value of the register. What is new in the Sandy Bridge is that it doesn't even execute this instruction. The register allocater simply allocates a new empty register for the result without even sending it to the execution units. This means that you can do four zeroing instructions per clock cycle without using any execution resources. NOPs are treated in the same efficient way without using any execution unit.



This technique is not new, actually. It has been used for many years with the FXCH instruction (exchange floating point registers). There are special reasons for resolving the FXCH instruction in the register allocater/renamer, but it is funny that this technique hasn't been extended to other uses until now. It would be obvious to use this technique for register-to-register moves too, but so far we have not seen such an application.

Data transport delay Most modern processors have different execution unit clusters or domains for different types of data or different types of registers, e.g. integer and floating point. Many processors have a delay of one or two clocks for moving data from one such domain to another. These delays are diminished in the Sandy Bridge and in some cases completely removed. I found that it is possible to move data between integer registers and vector registers without any delay.

Writeback conflicts When two micro-operations with different latencies run in the same execution port then they may both finish at the same time. This leads to a conflict when both need the writeback port and the result bus at the same time. Both Intel and AMD processors have this problem. The Sandy Bridge can avoid most writeback conflicts by fixing execution latencies to standard values, by allowing writeback to different execution domains simultaneously, and by delaying writeback when there is a conflict.

Floating point underflow and denormal numbers Denormal numbers are floating point numbers that are coded in a non-normal way which is used when the value is close to underflow, according to the official IEEE 754 standard. Most processors are unable to handle floating point underflow, denormal numbers, and other special cases in the general floating point execution units. These special cases are typically handled by microcode exceptions at the cost of 150 - 200 clocks per instruction. The Sandy Bridge can handle many of these special cases in hardware without any penalty. In my tests, the cases of underflow and denormal numbers were handled just as fast as normal floating point numbers for addition, but not for multiplication. My conclusion is that the Sandy Bridge processor has many significant improvements over previous processors. The most serious bottlenecks and weaknesses of previous processors have been removed. The micro-op cache turns out to be an important improvement for relatively small loops. Unfortunately, the poor performance of the decoders has not been improved. This remains a likely bottleneck for code that doesn't fit into the micro-op cache The decoding of instruction lengths has been a problem in Intel processors for many years. They tried to fix the problem with the trace cache in the Pentium 4, which turned out to be a dead end street, and now the apparently more successful micro-op cache in the Sandy Bridge. AMD have solved the problem of detecting instruction lengths in their processors by marking instruction boundaries in the code cache. Intel did the same in the Pentium MMX back in 1996, and it is a mystery to me why they are not using this solution today. There would hardly be a need for the micro-op cache if they had instruction boundaries marked in the code cache. Whenever the narrowest bottleneck of a system is removed then the next less narrow bottleneck becomes visible. This is also the case here. As the memory read bandwidth is doubled, the risk of cache bank conflicts is increased. Cache conflicts was actually the limiting factor in some of my tests. It has struck me that the new Sandy Bridge design is actually under-hyped. I would expect a new processor design with so many improvements to be advertised aggressively, but the new design doesn't even have an official brand name. The name Sandy Bridge is only an unofficial code name. In Intel documents it is variously referred to as "second generation Intel Core processors", "2xxx series", and "Intel microarchitecture code name Sandy Bridge". I have never understood what happens in Intel's marketing department. They keep changing their nomenclature, and they use the same brand names for radically different technical designs. In this case they have no reason to obscure technical differences. How can they cash in on the good reputation of the Sandy Bridge design when it doesn't even have an official name? [Corrected on June 08, 2011, and Mar 2, 2012].