A Clever Way To Find Compiler Bugs



Your comments are valuable, we thank you.



Xuejun Yang is a Senior Staff Engineer at FutureWei Technologies. He is the DFA on the 2011 paper, “Finding and Understanding Bugs in C Compilers.”

Today Ken and I discuss a clever idea from that paper.



The paper was brought to our attention just now in a meaty comment by Paul D. We thank him for it—the topic interests both of us. We don’t think Paul D. means to be anonymous, but in keeping with that we’ll give just a cryptic hint to his identity: The saying “a man on the make” is widely known, but for more than the millennium he has been the unique person in the world to whom it applies literally. Update 11/20: Turns out we (I, Ken) were wrong about the identity, see this.

Yang was made unique by being listed out of alphabetical order on the paper. This is notable because the most common practice in our field is to list alphabetically irrespective of prominence. Hence we’ve invented the term ‘DFA’ for “Designated” or “Distinguished” First Author. The other authors are Yang Chen, Eric Eide, and John Regehr, all from the University of Utah.

The Topic

Paul D.’s comment notes that there was evidence that verification methods could improve compiler correctness. By compiler we mean the program that transforms high level code into machine code. These programs are used countless times every day and their correctness is clearly very important.

Their correctness is tricky for several reasons. The main one is that almost all compilers try to optimize code. That is when they transform code into instructions they try to rewrite or rearrange the instructions to yield better performance. Compilers have been doing this forever. The trouble is that changing instructions to increase performance is dangerous. The changes must not affect the values that are computed. If they are not done carefully they can actually make the answers faster, but incorrect. This is the reason correctness is tricky.

Formal verification requires a lot of effort. The highest effort should go into mission-critical software. But compilers are mission-critical already, unless we know mission-critical software won’t be compiled on a particular one. Hence it is notable when formal verification makes a compiler more reliable.

The Paper

The idea in the paper Paul referenced is quite elegant. They built a program called Csmith. It operates as follows:

Suppose that is a compiler they wish to test. Then generate various legal C programs . For each of these let be the answer that yields. Here is the compiled program. Then check whether is correct.

For example:

int foo (void) { signed char x = 1; unsigned char y = 255; return x > y; }

Some compilers returned , but the correct answer is . There are further examples in a 2012 companion paper and these slides from an earlier version. The Csmith homepage has long lists of compiler bugs they found.

Of course if crashes or refuse to compile then the compiler is wrong. But what happens if is computed. How does Csmith know if the answer is correct? This seems to be really hard. This correctness testing must be automated: the whole approach is based on allowing tons of random programs to be tested. They cannot assume that humans will be used to check the outputs.

This is the clever idea of this paper. They assume that there are at least two compilers say and . Then let be the output of and let be the output of . The key insight is:

If is not equal to , then one of the compilers is wrong.

A very neat and elegant idea. For software in general it is called differential testing.

This at least alerts when there are problems with some compilers and some programs. One can use this trick to discover programs that cause at least some compilers to have problems. This is extremely valuable. It allowed Csmith to discover hundreds of errors in production compilers—errors that previously were missed.

Smart Fuzzing

Fuzzing is defined by Wikipedia as testing by “providing invalid, unexpected, or random data as inputs to a computer program.” An early historical example, Apple’s “Monkey” program, worked completely randomly. To ensure that the found bugs are meaningful and analyzable, Csmith needed a deeper, structured, “intelligent” design, not just the generation of Mayhem.

For one, Csmith needed to avoid programs than do not have deterministic behavior. The formal C standards itemize cases in which compilers are allowed to have arbitrary, even self-inconsistent, behavior. There are lots of them in C. A bug with dubious code could be dismissed out of hand.

For another, the probability that a program built haphazardly by the original Csmith version would reveal bugs was observed to peak at about 80KB source-code size, about 1,000 lines across multiple pages. Those don’t make great examples. So Csmith has its own routines to compress bug instances it has found. Simple tricks are shortening numerical expressions to use only the bug-sensitive parts. Others are lifting local variables out of blocks and bypassing pointer jumps.

A third goal is that the generator should branch out to all aspects of the language—in this case, C—not just the “grungy” parts that are ripe for finding compiler bugs. The paper talks about this at length. Regehr, who was Yang’s advisor, is also a blogger. His current post, dated November 4, is titled, “Helping Generative Fuzzers Avoid Looking Only Where the Light is Good, Part 1.” We guess that “Part 2” will go even more into details.

Formal Methods as Bugscreen

Regarding the formally-verified CompCert compiler, Paul D. quoted from the paper:

The striking thing about our CompCert results is that the middle-end bugs we found in all other compilers are absent. As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible benefits for compiler users.

This August 2019 paper by Michaël Marcozzi, Qiyi Tang, Alastair Donaldson, and Cristian Cadar gives recent results involving Csmith and other tools. They have an interesting discussion on page 2, from which we excerpt:

In our experience working in the area […], we have found compiler fuzzing to be a contentious topic. Research talks on compiler fuzzing are often followed by questions about the importance of the discovered bugs, and whether compiler fuzzers might be improved by taking inspiration from bugs encountered by users of compilers “in the wild.” Some … argue that any miscompilation bug, whether fuzzer-found or not, is a ticking bomb that should be regarded as severe, or avoided completely via formal verification (in the spirit of CompCert).

They go on to say, however, that when a fully-developed compiler is used for non-critical software, the kinds of bugs typically found by fuzzing tend to have questionable importance. Their paper is titled, “A Systematic Impact Study for Fuzzer-Found Compiler Bugs.”

So far they have found definite results that seem to have mixed implications. In their future-work section they note that they have evaluated the impact of bugs in compilers on the intended function of programs they compile, but not on possible security holes—which as we noted in our Cloudflare post can come from (misuse of) simple code that is completely correct. This leads us further to wonder, coming full-circle, whether formal methods might help quantify the relative importance of aspects of a language and areas of a compiler to guide more-intelligent generation of test cases.

Open Problems

The above comment is interesting, but perhaps finding obscure bugs is important. Perhaps such bugs could be used to attack systems. That is perhaps some one could use them to break into a system. Security may be compromised by any error, even an unlikely one to occur in the wild.

What do you think?