/backend /frontend /performance

Introduction

This article provides description of a tool that scans a set of source files looking for two types of functions declared with external linkage:

1. Those that are defined, but completely unused.

2. Those that are used only inside same translation unit they are defined in.

As a result we want to get:

1. Diagnostics that functions of the first type are unused, which means that one of the following holds:

a function needs and can be removed;

there are duplicated pieces of code;

another function is used by mistake.

2. Diagnostics that functions of the second type can be marked as static (in case the code is not part of a library; going to ignore this as it can be added later), as they are probably:

just not marked as static, which can be done harmlessly;

are not used anymore from outside translation unit they reside in;

have a wrong name in a public header.

Not all static analyzers provide such diagnostics and those that do - they do not advise to mark functions that are used only inside one translation unit with static rather than external linkage. As existing tools don't do exactly this, the "programmers way" of fixing this is to write our own tool (actually, much better option is too implement this diagnostic in cppcheck and send patch upstream, but we're mostly interested in learning more about AST of Clang and want to use it to solve the task; still nothing stops one from contributing to cppcheck).

For simplicity, we're going to do this for C rather that C++ to omit dealing with namespaces and methods. This allows us to concentrate on new stuff about Clang representation of source, leaving extending the tool to cover more use cases out of the scope.

This article also contains less sources - only the most interesting excerpts from the code are presented.

Matching

To do something useful we need to find elements of AST that we're interested in on the first place. This time they are:

1. Function declarations.

2. Function calls.

3. Getting address of a function.

Matchers used for the first item are very simple and the names are easy to find in documentation/headers with AST matchers/or even just to guess it:

static DeclarationMatcher funcDecl = functionDecl().bind("func"); 1 static DeclarationMatcher funcDecl = functionDecl ( ) . bind ( "func" ) ;

The last two items we want to be able to find require additional investigation. Lets make it simple by asking clang to dump AST of the following simple code to the screen (file named “func-ptr.c”):

void func(void) { } int main(void) { void (*f)(void) = &func; f(); return 0; } 1 2 3 4 5 6 7 8 9 10 11 void func ( void ) { } int main ( void ) { void ( * f ) ( void ) = &func; f ( ) ; return 0 ; }

Using this command:

clang -Xclang -ast-dump -fsyntax-only func-ptr.c 1 clang - Xclang - ast - dump - fsyntax - only func - ptr . c

Here is full output:

TranslationUnitDecl 0x2e86560 <<invalid sloc>> |-TypedefDecl 0x2e86a60 <<invalid sloc>> __int128_t '__int128' |-TypedefDecl 0x2e86ac0 <<invalid sloc>> __uint128_t 'unsigned __int128' |-TypedefDecl 0x2e86e10 <<invalid sloc>> __builtin_va_list '__va_list_tag [1]' |-FunctionDecl 0x2e86f20 <func-ptr.c:1:1, line:4:1> func 'void (void)' | `-CompoundStmt 0x2e86fc0 <line:3:1, line:4:1> `-FunctionDecl 0x2e870a0 <line:6:1, line:12:1> main 'int (void)' `-CompoundStmt 0x2ece4a0 <line:8:1, line:12:1> |-DeclStmt 0x2ece3e0 <line:9:5, col:28> | `-VarDecl 0x2ece340 <col:5, col:24> f 'void (*)(void)' | `-UnaryOperator 0x2ece3c0 <col:23, col:24> 'void (*)(void)' prefix '&' | `-DeclRefExpr 0x2ece398 <col:24> 'void (void)' Function 0x2e86f20 'func' 'void (void)' |-CallExpr 0x2ece438 <line:10:5, col:7> 'void' | `-ImplicitCastExpr 0x2ece420 <col:5> 'void (*)(void)' <LValueToRValue> | `-DeclRefExpr 0x2ece3f8 <col:5> 'void (*)(void)' lvalue Var 0x2ece340 'f' 'void (*)(void)' `-ReturnStmt 0x2ece480 <line:11:5, col:12> `-IntegerLiteral 0x2ece460 <col:12> 'int' 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 TranslationUnitDecl 0x2e86560 < < invalid sloc > > | - TypedefDecl 0x2e86a60 < < invalid sloc > > __int128_t '__int128' | - TypedefDecl 0x2e86ac0 < < invalid sloc > > __uint128_t 'unsigned __int128' | - TypedefDecl 0x2e86e10 < < invalid sloc > > __builtin_va_list '__va_list_tag [1]' | - FunctionDecl 0x2e86f20 < func - ptr . c : 1 : 1 , line : 4 : 1 > func 'void (void)' | ` - CompoundStmt 0x2e86fc0 < line : 3 : 1 , line : 4 : 1 > ` - FunctionDecl 0x2e870a0 < line : 6 : 1 , line : 12 : 1 > main 'int (void)' ` - CompoundStmt 0x2ece4a0 < line : 8 : 1 , line : 12 : 1 > | - DeclStmt 0x2ece3e0 < line : 9 : 5 , col : 28 > | ` - VarDecl 0x2ece340 < col : 5 , col : 24 > f 'void (*)(void)' | ` - UnaryOperator 0x2ece3c0 < col : 23 , col : 24 > 'void (*)(void)' prefix '&' | ` - DeclRefExpr 0x2ece398 < col : 24 > 'void (void)' Function 0x2e86f20 'func' 'void (void)' | - CallExpr 0x2ece438 < line : 10 : 5 , col : 7 > 'void' | ` - ImplicitCastExpr 0x2ece420 < col : 5 > 'void (*)(void)' < LValueToRValue > | ` - DeclRefExpr 0x2ece3f8 < col : 5 > 'void (*)(void)' lvalue Var 0x2ece340 'f' 'void (*)(void)' ` - ReturnStmt 0x2ece480 < line : 11 : 5 , col : 12 > ` - IntegerLiteral 0x2ece460 < col : 12 > 'int' 0

Look at this part:

|-DeclStmt 0x2ece3e0 <line:9:5, col:28> | `-VarDecl 0x2ece340 <col:5, col:24> f 'void (*)(void)' | `-UnaryOperator 0x2ece3c0 <col:23, col:24> 'void (*)(void)' prefix '&' | `-DeclRefExpr 0x2ece398 <col:24> 'void (void)' Function 0x2e86f20 'func' 'void (void)' 1 2 3 4 | - DeclStmt 0x2ece3e0 < line : 9 : 5 , col : 28 > | ` - VarDecl 0x2ece340 < col : 5 , col : 24 > f 'void (*)(void)' | ` - UnaryOperator 0x2ece3c0 < col : 23 , col : 24 > 'void (*)(void)' prefix '&' | ` - DeclRefExpr 0x2ece398 < col : 24 > 'void (void)' Function 0x2e86f20 'func' 'void (void)'

which corresponds to obtaining address of the function:

void (*f)(void) = &func;

Lets construct AST matcher for it:

void (*f)(void) = &func; Lets construct AST matcher for it: Static StatementMatcher funcAddrOp = unaryOperator( // any unary operator, e.g. *, &, -- hasOperatorName("&"), // exact unary operator: & declRefExpr( // referencing a variable/declaration to( // something that is ... functionDecl( // ... a function ).bind("ref") // bind matched func ref to "ref" name ) ) ).bind("op"); // bind matched unary op to "op" name 1 2 3 4 5 6 7 8 9 10 11 12 void ( * f ) ( void ) = &func; Lets construct AST matcher for it : Static StatementMatcher funcAddrOp = unaryOperator ( // any unary operator, e.g. *, &, -- hasOperatorName ( "&" ) , // exact unary operator: & declRefExpr ( // referencing a variable/declaration to ( // something that is ... functionDecl ( // ... a function ) . bind ( "ref" ) // bind matched func ref to "ref" name ) ) ) . bind ( "op" ) ; // bind matched unary op to "op" name

Looks like a nice matcher, but we're not going to use it. The reason is that the address of a function can be taken by implicit cast if one removes & in front of function name. That's why it makes sense to use a simpler and more general matcher, which is just an inner part of the one listed above:

static StatementMatcher funcRef = declRefExpr( // referencing a variable/declaration to( // something that is ... functionDecl( // ... a function ) ) ).bind("ref"); // bind matched func ref to "ref" name 1 2 3 4 5 6 7 static StatementMatcher funcRef = declRefExpr ( // referencing a variable/declaration to ( // something that is ... functionDecl ( // ... a function ) ) ) . bind ( "ref" ) ; // bind matched func ref to "ref" name

This effectively matches the leaf node:

| `-DeclRefExpr 0x2ece398 <col:24> 'void (void)' Function 0x2e86f20 'func' 'void (void)' 1 | ` - DeclRefExpr 0x2ece398 < col : 24 > 'void (void)' Function 0x2e86f20 'func' 'void (void)'

As it's similar to the leaf node of a call expression, we're getting all referencing cases we want with only one matcher.

Note that funcDecl is of type DeclarationMatcher rather then usual StatementMatcher. This is because each of core components of AST have its own hierarchies with different root objects, which means that such elements must be matched using different types of matchers.

Filtering

Someone might ask: how would we get function definition if we're only looking for declarations of functions? It's easy to understand if recall that every definition is also a declaration. So there is no such thing as function definition in Clang's AST, there aredeclarations with bodies instead. To check for body, useisThisDeclarationADefinition() method. There are also methods that check whether given function has body at all, don't confuse them with the method we actually need.

On each match of a function declaration we want to make sure that function is visible outside current module as we're not interested in static functions.

This can be done with the help of isExternallyVisible() method.

If you think of checking general programs, the first external function that comes to mind is probably the main() function. We don't want to mark it as unused, so filter it out by invoking handy isMain() method.

Match of funcRef matcher gives us result of type DeclRefExprwhich we need to resolve to function declaration it's referring to. This is done by the following code:

if (const FunctionDecl *func = ref->getDecl()->getAsFunction()) { // ... } 1 2 3 if ( const FunctionDecl * func = ref - > getDecl ( ) - > getAsFunction ( ) ) { // ... }

Here getDecl() returns ValueDecl which corresponds to a variable, function or enumeration constant definition. Then we query obtained ValueDecl object whether it can be converted to a function and get it as a function if the answer is yes. The check of return value is needed even if "it's definitely a function" because a node can return 0 in case of parsing errors (say, the code is correct, but some headers are missing).

Counting

Counting functions and references to them is more tricky than finding ones for the following reasons:

Functions can be declared in any number of modules or can be declared multiple times in one translation unit.

Function can be referred to before it's defined.

Third-party and system functions are matched as well.

As we want to get same results while scanning one file at a time in any order, bullets listed above should be treated carefully.

The implementation addresses items listed above in the following way:

Functions are stored in a map indexed by their names. There will be no name conflicts as we match only external functions and there are no overloaded functions in C.

Each function information object stores list of references.

Each function declaration and reference is associated with name of a file it resides in used to check whether function is ever referenced outside translation unit it is defined in.

Printing

To print exact position of something SourceLocation class lacks connection with actual source code. That's why FullSourceLocneeds to be constructed from an instance of SourceLocation and a reference to SourceManager. Here's how retrieval of source file name and line number can look like:

FullSourceLoc fullLoc(func->getNameInfo().getBeginLoc(), *sm); const std::string &fileName = sm->getFilename(fullLoc); const unsigned int lineNum = fullLoc.getSpellingLineNumber(); 1 2 3 FullSourceLoc fullLoc ( func - > getNameInfo ( ) . getBeginLoc ( ) , * sm ) ; const std : : string &fileName = sm->getFilename(fullLoc); const unsigned int lineNum = fullLoc . getSpellingLineNumber ( ) ;

Note getNameInfo().getBeginLoc() part. Getting location by calling getLocation() directly on an object of type FunctionDecl will return location of return type of the function. To be more programmer-friendly we want to guide one directly to function name, which is more convenient in my opinion. If it's still unclear why, here are two samples:

void func1(void) // <- getLocation() <- getNameInfo().getBeginLoc() ... void // <- getLocation() func2(void) // <- getNameInfo().getBeginLoc() ... 1 2 3 4 5 void func1 ( void ) // <- getLocation() <- getNameInfo().getBeginLoc() . . . void // <- getLocation() func2 ( void ) // <- getNameInfo().getBeginLoc() . . .

Blocking diagnostics output

One somewhat annoying in our use case thing about Clang is that by default it prints diagnostics on code being analyzed. We want to suppress such diagnostic messages to leave only our own. The correct way of doing this is to callDiagnosticsEngine::setSuppressAllDiagnostics, but it's not clear how to get instance of DiagnosticsEngine used by tool while it builds ASTs. So we go another way and subclassDiagnosticConsumer to override its IncludeInDiagnosticCounts method and make it return false:

class : public DiagnosticConsumer { public: virtual bool IncludeInDiagnosticCounts() const { return false; } } diagConsumer; tool.setDiagnosticConsumer(&diagConsumer); 1 2 3 4 5 6 7 8 9 10 class : public DiagnosticConsumer { public : virtual bool IncludeInDiagnosticCounts ( ) const { return false ; } } diagConsumer ; tool . setDiagnosticConsumer ( &diagConsumer);

This way such diagnostics are not counted as relevant when Clang tries to present parsing results to a user.

This way such diagnostics are not counted as relevant when Clang tries to present parsing results to a user.

Testing

Assuming that you have successfully built the tool from the repository lets give it a run over a simple test files. The first file (main.c) looks like this:

static void firstStatic(void); static void secondStatic(void); void firstExtern(void); extern void secondExtern(void); static void firstStatic(void) { } static void secondStatic(void) { } void firstExtern(void) { } void secondExtern(void) { } int main(void) { firstExtern(); secondStatic(); return 0; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 static void firstStatic ( void ) ; static void secondStatic ( void ) ; void firstExtern ( void ) ; extern void secondExtern ( void ) ; static void firstStatic ( void ) { } static void secondStatic ( void ) { } void firstExtern ( void ) { } void secondExtern ( void ) { } int main ( void ) { firstExtern ( ) ; secondStatic ( ) ; return 0 ; }

Lets check output when the first file is analyzed alone (paths are truncated):

> unused-funcs main.c -- .../main.c:20:firstExtern:can be made static .../main.c:25:secondExtern:unused 1 2 3 > unused - funcs main . c -- . . . / main . c : 20 : firstExtern : can be made static . . . / main . c : 25 : secondExtern : unused

Let's go over the file manually to check whether obtained output is correct:

main() is treated separately.

firstStatic() and secondStatic() are both ignored because they are marked as static.

firstExtern() declared as not static and isn't used.

secondExtern() declared as extern and used only within the same translation unit.

Looks good.

Now add the second file (util.c):

extern void firstExtern(void); void secondExtern(void); void thirdExtern(void) { firstExtern(); secondExtern(); } 1 2 3 4 5 6 7 8 extern void firstExtern ( void ) ; void secondExtern ( void ) ; void thirdExtern ( void ) { firstExtern ( ) ; secondExtern ( ) ; }

And see what's changed in the output (paths are truncated):

> unused-funcs main.c util.c -- .../util.c:5:thirdExtern:unused 1 2 > unused - funcs main . c util . c -- . . . / util . c : 5 : thirdExtern : unused

Expected changes are as follows:

Both and are not used outside their home module, so no diagnostics should mention them.

New unused function () was introduced.

Looks correct too.

By the way, here's the output for func-ptr.c test file from the "Matching" section above:

> unused-funcs func-ptr.c -- .../func-ptr.c:2:func:can be made static 1 2 > unused - funcs func - ptr . c -- . . . / func - ptr . c : 2 : func : can be made static

Conclusion

As you've been warned, this is more C-related implementation than C++, but such limitation allowed for concise description and ready-to-use state after putting not that much effort in the implementation.

The resultant tool can be adjusted in multiple ways by changing matching/counting/output parts independently:

getting list of all external functions references from the code;

building graph description of cross-module dependencies to be rendered by Graphviz (fine-grain version could list exact functions used);

collecting statistics like ratio of provided extern function vs. number of used extern functions, or number of external usages for each function marked as extern;

previous bullet combined with some thresholds can be used to detect translation units with low cohesion/high coupling;

etc.

Note that as Clang takes macros into account the tool can produce not accurate results if conditional compilation is used. Precisely, it analyzes some particular combination, defines and ignores all other. That's why it's better to check updated code against combinations of macro defines or at least remember about them. This is important for cross-platform applications or programs that allow to disable some of their features at compile-time.

Resources