LCA: Static analysis with GCC plugins

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

Taras Glek works for Mozilla, but he is not a browser hacker; instead, he works on GCC and other tools aimed at making the browser development process better. It is, he says, a good job. While carrying out his duties, Taras has been able to put a new GCC feature to work in ways which may prove to be useful well beyond Mozilla.

Development tools are important; they can help us to produce software more quickly and with far fewer problems. Unfortunately, Taras says, we are stuck in the stone age of software development, using tools from the 1970's. Our code base is growing, though, to the point that developers often cannot understand the entirety of even a single application. We need some way to amplify our capabilities so that we can continue to make more powerful applications; static analysis tools can bring some of those capabilities.

Static analysis, in essence, treats the code as data which is then the subject of further analysis. It has often been seen as a backwater, an area of primarily academic interest. When static analysis tools have found their way into more common use, it has generally been in their ability to find certain classes of bugs. But there's more that can be done with these tools: finding API abuse, generating library bindings, improved code base visualization, and more. Static analysis has been put to use with Mozilla to find dead code; thousands of lines of code have been found to be completely unused, despite the fact that engineers were putting their time into maintaining it.

The Mozilla project has an especially strong need for good tools. It is a huge code base (1.7 million lines of C++ and 1 million lines of JavaScript); humans just do not scale to that amount of code. This code base is under constant optimization work, so refactorings are frequent. Without some help, keeping this code in good condition is a major challenge.

Much of Taras's work seems to be aimed at mitigating some of the pains that come with C++ development. One of those pains is that the language is just about impossible to parse; the parser must actually instantiate types before it can complete its job. So anybody who wants to analyze C++ code must first find a decent parser for it. The available options are limited. The LLVM compiler is promising, but it's going to be another year or two before it's really ready for prime time. The Elsa tool can be used, but it's essentially unmaintained and not really guaranteed to be correct.

The one other option - one which is known to have a complete C++ parser - is GCC. But the GCC code has a bit of a nasty reputation, so Taras started off using Elsa for his work. Eventually, though, he turned back to GCC for something more solid, and hasn't looked back - the hairiness of GCC has, perhaps, been exaggerated. But, more to the point, the upcoming GCC 4.5 release is, he says, "the most exciting release ever." The reason for that is the long-delayed addition of the plugin API, which became possible once the runtime library license exemption finally went into place. With this API, analysis code can easily hook into the compiler and inspect code at whatever stage of the process best suits its needs.

Beyond plugins, GCC has a few other features which make it suitable for static analysis work. The ability to attach attributes to objects in the compiled code makes it easy to pass hints through to later processing steps. The new pass manager brings a relatively modern structure to a compiler which did not originally have one. And the GIMPLE intermediate representation provides much of the rest of what's needed for code which needs to inspect other code.

There are a few interesting plugins in the works. One of them is the LLVM compiler, which can be plugged in to perform the back-end functions for GCC. Another is milepost, which uses a brute-force approach to figure out the optimal settings of the command-line flags for a specific body of code. Then, there are "the hydras," which are Taras's work. These plugins take an interesting approach, in that the actual analysis work is done in JavaScript scripts. The idea was originally seen as amusing - "wouldn't it be fun to put Spidermonkey into GCC?" - but it has actually worked out well. JavaScript is a relatively nice, concise language which makes it easy to implement the needed capabilities.

The first plugin is Dehydra, so named because the control flow graph in Mozilla somewhat resembles a Hydra monster. Dehydra produces a JSON-like representation of the objects found in a C++ program; individual JavaScript scripts can then use this representation to analyze the program. The Treehydra plugin, instead, provides a JavaScript interface to the GIMPLE representation of the program; it can be used for more traditional sorts of static analysis tasks.

One of the pains that come with large C++ programs is that simply finding code can be difficult. It's not always clear which method will be invoked in a specific situation, even in the absence of things like macro tricks. To help with this problem, Dehydra has been used as the base of a source browsing tool called DXR; it's like LXR, but with a great deal of semantic information thrown in. DXR users can find types defined by macros, look up parent class information, and so on. There's also a call graph tool which can find all the callers of a specific method; that's important in C++, where overloading can make grep thoroughly unusable for this kind of task. It is, Taras says, "Eclipse-like stuff," except that, unlike Eclipse, it scales to a Mozilla-size code base.

Various other tools have been written. The final.js script (a dozen lines of code which can be seen on this page) looks for C++ methods tagged with the "final" attribute; any attempt to override those methods will result in a compilation error. It is, in other words, a port of the Java final keyword to C++. A checker which might be interesting in other environments - including the kernel - is flow.js , which can add a constraint that all exits from a function must flow through a specific label. Consider this common kernel pattern:

if (something wrong) goto out; /* Do some real work */ out: release_locks(); free_memory(); cancel_self_destruct() return something;

It's a common mistake to add a return statement to the middle of a function like this, shorting out the cleanup code; flow.js can catch errors like that at compile time.

Additional modules include must-override.js , which can mark methods which must be overridden (but which cannot be virtual); outparams.js , which ensures that any output function parameters have been set on a successful return from the function, and stack.js , which enforces a requirement that specific classes only be instantiated on the stack, since the garbage collector is not prepared to deal with them. Taras is also working on a checker for variables which shadow class members - a mistake which GCC does not catch now.

For the time being, this work is mostly used within the Mozilla project, though Taras would clearly like to see users from the wider community. He looks forward to a day when libraries are distributed with a plugin which ensures that the library is being used correctly. Another nice feature would be a distribution-wide DXR, enabling cross-package source browsing. For now, though, we have a set of tools that serves as a good proof of the concept that GCC plugins can be used for static analysis.

