Introducing a new, advanced Visual C++ code optimizer

Gratian

May 4th, 2016

We are excited to announce the preview release of a new, advanced code optimizer for the Visual C++ compiler backend. It provides many improvements for both code size and performance, bringing the optimizer to a new standard of quality expected from a modern native compiler.

This is the first public release and we are encouraging people to try it and provide suggestions and feedback about potential bugs. The official release of the new optimizer is expected to be Visual Studio Update 3, while the release available today is unsupported and mostly for testing purposes.

How to try it out

The compiler bits with the new optimizer are very easy to get: just install the latest VisualCppTools package using NuGet. Details about how to do this are available in this blog post. Once installed, compile your applications the usual way – the optimizer is enabled by default on all architectures.

Update 06/10/2016: The new optimizer is now also available as part of Visual Studio Update 3 RC.

Reporting bugs and suggestions

We are hoping to get as much feedback as possible about bugs you have found or suggestions you may have. If you believe you found a bug, you can confirm it’s caused by the new optimizer by using the following undocumented flag to disable it: -d2SSAOptimizer-

In the Visual Studio IDE, add the flag to the project Property Pages -> C/C++ -> Command Line -> Additional Options text box

If you compile from command line using cl.exe, add the flag before any /link options

If the bug does not manifest anymore with -d2SSAOptimizer-, please follow the steps below:

Submit a bug report using the Connect website

Prefix the title with [SSA Optimizer]

Attached details such as the compiler version, compile flags, and the source code that reproduces the bug in the form of pre-processed files or a linkrepro. Bruce Dawson’s blog has a great post about producing high-quality bug reports

You can also send an email directly to gratilup@microsoft.com

Why a new optimizer?

The main motivation for a new optimizer framework was the desire to have more aggressive optimizations, such as ones that take advantage of more compile-time information and modern compiler developments. The design of some of the older optimization passes made it difficult to implement more advanced transformations and to make improvements at a faster pace. As the new framework was intended to be the basis of many future optimization efforts, a core design objective was to make it easier to implement, test and measure new optimizations.

Some of the main goals of the project:

Improving the code quality for both scalar and vector code

There are many cases where both performance and code size can be improved, sometimes quite substantially. The framework attempts to solve several deficiencies of the old optimizer:

The old expression optimizer has a small set of known transformations and a limited view of the function – this prevents discovering all the expressions that could be optimized. Many small optimizations based on identifying patterns – known as peephole optimizations – are either missing or implemented only for certain target architectures. Vector code – either from intrinsics or generated by the auto-vectorizer – can be optimized better.



The new optimizer takes advantage of the Static Single Assignment form, which allows handling more complex expressions, that potentially span the entire function. Another advantage of the SSA form is that it makes it possible to write simpler and more efficient algorithms, eliminating the need of using more complicated and slower techniques such as data-flow analysis.

Peephole optimizations can now be implemented in a target-independent way, using a pattern matching system that is very fast (based on template meta-programming) and which requires little code to be written. This allowed adding a large number of patterns in a fraction of the time it takes to add using the usual way of identifying patterns.

The same pattern matching mechanism can be used for vector operations, making it now possible to optimize expressions using both integer and float vector operations as easily as expressions with scalar operations. Note that this feature is not yet complete and enabled.

Designing a framework that allows easy development, with less potential for mistakes

Being able to quickly prototype ideas and move to a reliable implementation is one of the main advantages of the new framework. It includes various helpers for easier manipulation of the SSA form, pattern matching of expressions, building new expressions and doing safety checks in the presence of pointer aliasing and exception handling.

Performing better static analysis of the code

The new optimizer also adds new static analysis modules, including those that can identify when a value is Boolean (exactly either 0 or 1), when a value is always positive, and when a value cannot be zero. It also has a powerful module that can estimate known one/zero bits of a value, and the ranges a value could fall in. The results are either used as preconditions for certain optimizations, to eliminate some useless operations completely or to transform operations into a form that can be optimized better.

Strong emphasis on testing and correctness

Given the large scope of the project, ensuring and maintaining correctness was a top priority. This was achieved by using formal verification, testing with randomly-generated programs (fuzz testing) and popular programs and libraries such as Chrome, Firefox, CoreCLR and Chakra. See the Testing approach section below for more details.

Examples of implemented optimizations

The following is an example that illustrates just a few of the many new transformations the new optimizer implements. This sort of code is often found in codecs:

int test(int a) { return a % 2 != 0 ? 4 : 2; }

x64 assembly with old optimizer x64 assembly with new optimizer ?test@@YAHH@Z PROC and ecx, -2147483647 jge SHORT $LN3@test dec ecx or ecx, -2 inc ecx $LN3@test: test ecx, ecx mov eax, 2 mov edx, 4 cmovne eax, edx ret 0 ?test@@YAHH@Z PROC and ecx, 1 lea eax, DWORD PTR [rcx*2+2] ret 0

The execution time with the old optimizer is approximately 5 cycles in the best case (this assumes out-of-order execution and perfect branch prediction) and at least 10 cycles in the worst case. With the new optimizer, execution time is always 2 cycles. Obviously, there are also important savings in code size.

Very interesting results can be achieved by combining multiple smaller transformations. In this case, there are two patterns applied to produce the final result:

a % 2 == 0 -> a & 1 == 0 Since the remainder is esed to zero, the sign of a does not affect the compare result and the remainder can be replaced by AND.

Since the remainder is esed to zero, the sign of a does not affect the compare result and the remainder can be replaced by AND. a<bool> ? C1 : C2 -> C2 + a*(C1-C2) A ternary question operation selecting between two constants. The first requirement is that the condition value is Boolean, which the static analysis package can determine. The second is that C1-C2 is a power of two, so that a shift or LEA is generated instead of a multiplication.

Let’s see a few more examples of interesting optimizations and patterns that are implemented. Focus was put especially on operations that were previously not optimized very well, such as comparisons, conversions, divisions, question and control-flow dependent expressions (PHI operations in SSA form). Although some examples might seem unlikely to be written like that in the source code, they do appear quite often after inlining and other transformations.

Improved optimization of arithmetic expressions, including scalar float operations

The SSA form exposes larger expressions, which can span the entire function – this allows discovering more optimization opportunities, especially when combined with expression reassociation. There are also dozens of new patterns added, such as the following ones:

(a / C1) / C2 -> a / (C1 * C2) (a * C1) / C2 -> a * (C1 / C2) a / (x ? C1 : C2) -> a >> (x ? log2(C1), log2(C2)) // C1 and C2 must be power of two constants

Most new float optimizations are enabled only under -fp:fast, but some of them are valid under the default -fp:precise. More information about the optimizations allowed under different floating point models is available in the documentation: Microsoft Visual C++ Floating-Point Optimization

Optimizing control-flow dependent expressions

I mentioned above that the SSA format simplifies handling larger, more complex expressions. One advantage is that it makes it easier to reason about variables that are either redefined, or defined with different values based on the path taken in the function. As its name implies, SSA solves this by creating a different version of the variable each time it is redefined; if there are points in the function where a variable has more than one possible value, a pseudo-operation known as PHI is inserted, merging all values.

Although building the SSA format is quite complicated, the example below should be simple enough to get a good intuition about SSA and the role of the PHI operations:

Original code After SSA conversion int test(int a, int b) { int x, y, z; if(a > 3) { x = 4; y = 1; z = b & 0xFF00; } else { x = 9; y = 2; z = b << 8; } int p = (x * y) * 4; int q = z & 0xF; return p >= 16 && q == 0; } int test(int a1, int b1) { int x0, y0, z0; // undefined if(a1 > 3) { x1 = 4; y1 = 1; z1 = b1 & 0xFF00; } else { x2 = 9; y2 = 2; z2 = b1 << 8; } x3 = PHI (x1, x2) y3 = PHI (y1, y2) z3 = PHI (z1, z2) int p1 = (x3 * y3) * 4; int q1 = z3 & 0xF; return p1 >= 16 && q1 == 0; }

As it can be seen on the right side, each variable is renamed to multiple versions (indicated by the number suffix). After the if-then-else statement, all three variables can have two different values, depending on the runtime result of a > 3, making it necessary to insert PHI operations.

The new optimizer is able to take advantage of the PHI operations and turn the entire function into the equivalent of return 1, all other code being removed by Dead Code Elimination. That’s 1 instruction compared to the 18 that were generated before on x64. For p1 >= 16 it computes every possible value and compares it with 16, which is the minimum possible value. For q1 == 0 it checks if the low bits are known to be zero in both z1 and z2.

The old expression optimizer is not able to reason about the larger expressions that involve these PHI operations – this causes it to miss many optimization opportunities, like the ones exemplified above. In the new optimizer, every operation and static analysis supports PHI. A few more examples:

(phi 3, 5) + 2 -> phi 5, 7 // constant-fold by pushing operand inside a PHI (phi b+3, b+5) - b -> phi 3, 5 // eliminate operation by pushing operand inside a PHI phi a+x, b+x -> (phi a, b) + x // extract a common operand from a PHI (phi 1,2) + 3 < (phi 3,4) + 5 -> true // fold compare by testing all combinations (phi 1,2) * (phi 2,3) > (phi 6,7) * phi(2,3) -> false // similar to above example (phi 1,0) * 5 > (phi 1,2) -> undecidable // 0 * 5 < (phi 1,2)

The following is an interesting case found in Mozilla Firefox. A Boolean expression, spanning an if-then-else statement, is used in a negated form if(!expr). The new algorithm that tries to cancel an inverted Boolean operation by inverting every subexpression did the following transformation, eliminating the inversion:

(phi 0, (x ? 1 : 0)) ^ 1 -> phi 1, (x ? 0 : 1)

Better conditional move generation

Converting branches to CMOV produces more compact code that usually executes faster. The late CMOV generation phase is augmented by generating question operations during the new optimizer. In doing so, already-existing transformations can be applied, simplifying things even further. In the following examples, the left-hand side is a newly-detected CMOV pattern, and the right-hand side is the code after a transformation is applied:

a < 0 ? 1 : 0 -> a >> 31 // logical shift a < 0 ? 4 : 0 -> (a >> 31) & 4 // arithmetic shift a<bool> != b<bool> ? 1 : 0 -> a ^ b // a, b must be Boolean values

CMOV performance can sometimes be hard to estimate, especially on modern CPUs with good branch prediction. To help in cases where a branch would be faster, when profile information is available, the CMOV is not generated if the branch is highly predictable (heavily biased as either taken or not-taken).

Improved optimization of compare operations

Comparisons are the operations with the most improvements. Since reducing the number of branches benefits both code size and performance, the focus was mainly on branch folding (eliminating a branch by proving that it is either taken or not-taken). Besides the usual tests for comparing constants, static analysis is used to estimate value ranges and known one/zero bits, making it possible to handle more complicated cases. Among the dozens of transformations that simplify comparisons, the following one is an example that reduces execution time substantially:

a / 12 == 15 -> a in range [180, 192) -> (a – 180) < 12 // unsigned compare

A division (20+ cycles) is replaced by a simple range check (2 cycles). Even when the “divide by constant” optimization is applied, it is still a few times slower than the range check.

Bit Estimator

This is a powerful static analysis that can be used to extract more compile-time information about values. Some of the provided features:

Estimating bits known to be one or zero Proving that a value is not zero Estimating the minimum and maximum value Estimating value ranges Improved overflow checks for addition and subtraction



Below is a simple example showing how the one/zero bits can be computed at at compile time, even when nothing is known about the initial values (parameter a in the example below):

int test(unsigned char a) { short b = a; // b: 00000000________, a: ________ b <<= 4; // b: 0000________0000 b |= 3; // b: 0000________0011 return b != 0; // -> return true }

Some of the places where these features are currently used:

Converting signed instructions to unsigned : produces smaller code for division/remainder with constant, allows folding constants into LEA instructions, etc. Folding comparisons and branches : comparisons are folded using both known bit and value range information. For example, given a == b, if a is known to have a bit set at a position where it is definitely not set in b, the two values cannot be equal. This can be applied to other conditions such as less-than by checking the sign bit. When using value ranges, every range of a is compared with every range of b. Improved overflow checks : optimizing a + C1 < C2 into a < C2 – C1 is not valid, since a + C1 might overflow, giving a different result. Using the known bits or value ranges, it can be proven that the addition does not overflow. In practice, this usually happens when a is a zero-extension from a smaller type. Discovering Boolean and positive values: used as pre-conditions for various optimizations, such as the ones applied on question operations. Another example is eliminating an ABS intrinsic if the value is already positive. Removing redundant AND/OR instructions, eliding useless conversions:



a % C -> 0 if C is a power of two and the low bits in a are zero (a is a multiple of C) a & C -> 0 if all bits that are one in C are known to be zero in a a | C -> a if all bits that are one in C are known to be one in a

Improved Common Subexpression Elimination

Common Subexpression Elimination is an optimization that eliminates redundant operations by replacing them with the result of previous ones that compute the same value – this happens much more often than one may expect. The existing algorithm is augmented with one based on Global Value Numbering, which increases the number of expressions that are found to be equivalent. Although this is a quite simple initial implementation that will be made more powerful, it shows significant improvements for both code size and performance.

Eliminating redundant operations before doing the expression optimization also exposes more opportunities. For example, (a + b) – c -> a if b is found to be equivalent to c.

Taking advantage of signed integer overflow being undefined

Historically, Visual C++ did not take advantage of the fact that the C and C++ standards consider the result of overflowing signed operations undefined. Other compilers are very aggressive in this regard, which motivated the decision to implement some patterns which take advantage of undefined integer overflow behavior. We implemented the ones we thought were safe and didn’t impose any unnecessary security risks in generated code.

A new undocumented compiler flag has been added to disable these optimizations, in case an application that is not standard-conformant fails: –d2UndefIntOverflow–. Due to security concerns, we have seen cases where these patterns should not be optimized, even though following the C and C++ standards allows us to by making the potential addition overflow undefined:

a + Constant > a -> true // Constant > 0 a + Constant <= a -> false // Constant > 0

These two tests (and the similar ones with subtraction) are frequently used to check for overflow in places such as file readers and memory allocators. While the use is non-conformant with the standard and a well-known issue, enabling these transformations could potentially break the security of those applications.

Impact on code size

For most applications code size is reduced, but it can also increase due to interactions with other optimizations. For example, a smaller function is more likely to be inlined into multiple places, resulting in a size increase overall.

Below are some code size results from compiling several large applications on x64:

Application Old optimizer New optimizer Reduction Windows 1,112,545,269 1,112,096,059 438 KB SQL Server 64,078,336 64,032,256 46 KB Chakra 5,963,621 5,952,997 10 KB

The following table lists the number of instructions, split by category, for the Windows Kernel built for x64 with link-time code generation and profile information. It can be seen that the number of more expensive instructions, such as branches, divisions and multiplications, is reduced. The increase in CMOV and SETcc is a result of more branches being converted to conditional code.

Instruction type Old optimizer New optimizer Difference CONVERSION 28075 27301 -774 LEA 87658 87395 –263 SHIFT 15266 15194 -72 SETcc 2222 2345 +123 JUMP 19797 19791 -6 BRANCH 143795 142591 -1204 MUL 2115 1990 -125 DIV 541 530 -11 CMOV 4192 5913 +1721

Impact on compiler throughput

For all these improvements, compile time remains mostly the same, with about +/- 2% difference, depending on the application being compiled. For example, Google Chrome shows a compile time slowdown of 1.7%, while compiling the Windows Kernel shows a 2.6% speed-up. The speed-up can be explained by having less code go through the old, slower optimization passes.

Testing approach

Based on previous experience and the scope of the project, it was clear from the start that extensive testing needs to take a central role to ensure correctness. Several testing approaches were used, some to prevent mistakes in the first place, others to catch implementation problems:

Preventing implementation bugs by formally verifying the patterns

Most patterns are quite simple, such as x & 0 => 0. But there are also patterns that requires validation that is not always very obvious, leaving place for mistakes. The most common validation bugs are:

Failing to check for input preconditions, such as requiring positive numbers, powers of two, numbers with the N top bits 0, etc

Failing to differentiate between signed and unsigned operations. This is especially dangerous for instructions such as CMP, DIV/REM and SHR.

Alive, a tool by Nuno Lopes from Microsoft Research, is a formal verification tool that was used to ensure the patterns and preconditions are correct before implementing them. It uses a language similar to LLVM IR and the Z3 theorem prover to verify if an input pattern is equivalent to the output pattern – if not, it prints a counterexample. Alive has already been used by the LLVM community with great success to discover many bugs. More details about Alive can be found on John Regehr’s blog: ALIVe: Automatic LLVM InstCombine Verifier.

Covering and testing as many patterns as possible using random tests

Csmith is a random C program generator that has been used to discover a large number of bugs in various compilers. More than 15 million programs generated using CSmith have been tested, revealing several bugs in the new optimizer, plus bugs in other optimizer components. Very helpful in dealing with the huge failing tests was C-Reduce: is was able to reduce 200KB tests to tests of 2-3KB in size, making it much easier to spot the place with the bug.

Testing every three-instruction expression

Opt-fuzz, a tool by John Regehr from University of Utah, is able to generate every small integer expression with N instructions and a limited number of possible constants as LLVM IR. The Clang/C2 project made it possible to test all 250+ million tests generated for three-instruction expressions, which revealed several subtle bugs.

Using instrumentation and runtime checks

Complex components, such as the Bit Estimator and Value Numbering, were tested by instrumenting the compiled code with calls to a runtime library that verifies if the compile-time, static analysis results are actually valid. For example, in the case of the Bit Estimator, it would verify that the bits that were estimated to be always zero are zero at runtime. In the case of Value Numbering, it would ensure that two instructions that were assigned the same value number have the same value at runtime.

Testing with popular open-source projects

Exposing the compiler to more real-world code proved to be an effective way to find more bugs. This includes building and testing Google Chrome, Mozilla Firefox, CoreCLR and Chakra.

Future improvements

As I mentioned at the start of the blog post, the framework is designed to be the place where many of the future optimizer features will be implemented. Below are some of the optimizations that are very likely to be part of the next major Visual Studio release – it does not include any of the longer-term projects that are planned:

Complete and enable the optimization of vector operations

Better optimization of Boolean expressions in C++ code

Removal of operation with no effect on the expression result

Merging similar branches

Several Bit Estimator improvements

Closing remarks

Please try building and testing your applications with the new optimizer and report any problems that you might find. We are looking forward for your suggestions and opinions in the comment section. Let us know if you have examples of cases that could be optimized better and are not yet handled.

We are glad to finally be able to share this exciting new work with you! This marks the start of many optimizer improvements that will be added in the future releases of the compiler – we will keep you posted.

Thanks, Gratian Lup Visual C++ Optimizer team