Clang's APIs evolve quickly, and this includes libclang and the Python bindings. Therefore, the samples in this post may no longer work. For working samples that are kept up-to-date with upstream Clang, check out my llvm-clang-samples repository on Github

People that need to parse and analyze C code in Python are usually really excited to run into pycparser. However, when the task is to parse C++, pycparser is not the solution. When I get asked about plans to support C++ in pycparser , my usual answer is - there are no such plans , you should look elsewhere. Specifically, at Clang.

Clang is a front-end compiler for C, C++ and Objective C. It's a liberally licensed open-source project backed by Apple, which uses it for its own tools. Along with its parent project - the LLVM compiler backend, Clang starts to become a formidable alternative to gcc itself these days. The dev team behind Clang (and LLVM) is top-notch and its source is one of the best designed bodies of C++ code in the wild. Clang's development is very active, closely following the latest C++ standards.

So what I point people to when I'm asked about C++ parsing is Clang. There's a slight problem with that, however. People like pycparser because it's Python, and Clang's API is C++ - which is not the most high-level hacking friendly language out there, to say the least.

libclang Enter libclang . Not so long ago, the Clang team wisely recognized that Clang can be used not only as a compiler proper, but also as a tool for analyzing C/C++/ObjC code. In fact, Apple's own Xcode development tools use Clang as a library under the hood for code completion, cross-referencing, and so on. The component through which Clang enables such usage is called libclang . It's a C API that the Clang team vows to keep relatively stable, allowing the user to examine parsed code at the level of an abstract syntax tree (AST) . More technically, libclang is a shared library that packages Clang with a public-facing API defined in a single C header file: clang/include/clang-c/Index.h .

Python bindings to libclang libclang comes with Python bindings, which reside in clang/bindings/python , in module clang.cindex . This module relies on ctypes to load the dynamic libclang library and tries to wrap as much of libclang as possible with a Pythonic API.

Documentation? Unfortunately, the state of documentation for libclang and its Python bindings is dire. The official documentation according to the devs is the source (and auto-generated Doxygen HTML). In addition, all I could find online is a presentation and a couple of outdated email messages from the Clang dev mailing list. On the bright side, if you just skim the Index.h header file keeping in mind what it's trying to achieve, the API isn't hard to understand (and neither is the implementation, especially if you're a bit familiar with Clang's internals). Another place to look things up is the clang/tools/c-index-test tool, which is used to test the API and demonstrates its usage. For the Python bindings, there is absolutely no documentation as well, except the source plus a couple of examples that are distributed alongside it. So I hope this article will be helpful!

Setting up Setting up usage of the Python bindings is very easy: Your script needs to be able to find the clang.cindex module. So either copy it appropriately or set up PYTHONPATH to point to it .

module. So either copy it appropriately or set up to point to it . clang.cindex needs to be able to find the libclang.so shared library. Depending on how you build/install Clang, you will need to copy it appropriately or set up LD_LIBRARY_PATH to point to its location. On Windows, this is libclang.dll and it should be on PATH . That arranged, you're ready to import clang.cindex and start rolling.

Simple example Let's start with a simple example. The following script uses the Python bindings of libclang to find all references to some type in a given file: #!/usr/bin/env python """ Usage: call with <filename> <typename> """ import sys import clang.cindex def find_typerefs (node, typename): """ Find all references to the type named 'typename' """ if node.kind.is_reference(): ref_node = clang.cindex.Cursor_ref(node) if ref_node.spelling == typename: print 'Found %s [line=%s, col=%s]' % ( typename, node.location.line, node.location.column) # Recurse for children of this node for c in node.get_children(): find_typerefs(c, typename) index = clang.cindex.Index.create() tu = index.parse(sys.argv[ 1 ]) print 'Translation unit:' , tu.spelling find_typerefs(tu.cursor, sys.argv[ 2 ]) Suppose we invoke it on this dummy C++ code: class Person { }; class Room { public : void add_person(Person person) { // do stuff } private : Person* people_in_room; }; template < class T , int N> class Bag <T, N> { }; int main() { Person* p = new Person(); Bag<Person, 42 > bagofpersons; return 0 ; } Executing to find referenced to type Person , we get: Translation unit: simple_demo_src.cpp Found Person [line=7, col=21] Found Person [line=13, col=5] Found Person [line=24, col=5] Found Person [line=24, col=21] Found Person [line=25, col=9]

Understanding how it works To see what the example does, we need to understand its inner workings on 3 levels: Conceptual level - what is the information we're trying to pull from the parsed source and how it's stored

libclang level - the formal C API of libclang , since it's much better documented (albeit only in comments in the source) than the Python bindings

level - the formal C API of , since it's much better documented (albeit only in comments in the source) than the Python bindings The Python bindings, since this is what we directly invoke Creating the index and parsing the source We'll start at the beginning, with these lines: index = clang.cindex.Index.create() tu = index.parse(sys.argv[ 1 ]) An "index" represents a set of translation units compiled and linked together. We need some way of grouping several translation units if we want to reason across them. For example, we may want to find references to some type defined in a header file, in a set of other source files. Index.create() invokes the C API function clang_createIndex . Next, we use Index 's parse method to parse a single translation unit from a file. This invokes clang_parseTranslationUnit , which is a key function in the C API. Its comment says: This routine is the main entry point for the Clang C API, providing the ability to parse a source file into a translation unit that can then be queried by other functions in the API. This is a powerful function - it can optionally accept the full set of flags normally passed to the command-line compiler. It returns an opaque CXTranslationUnit object, which is encapsulated in the Python bindings as TranslationUnit . This TranslationUnit can be queried, for example the name of the translation unit is available in the spelling property: print 'Translation unit:' , tu.spelling Its most important property is, however, cursor . A cursor is a key abstraction in libclang , it represents some node in the AST of a parsed translation unit. The cursor unifies the different kinds of entities in a program under a single abstraction, providing a common set of operations, such as getting its location and children cursors. TranslationUnit.cursor returns the top-level cursor of the translation unit, which serves as the stating point for exploring its AST. I will use the terms cursor and node interchangeably from this point on. Working with cursors The Python bindings encapsulate the libclang cursor in the Cursor object. It has many attributes, the most interesting of which are: kind - an enumeration specifying the kind of AST node this cursor points at

- an enumeration specifying the kind of AST node this cursor points at spelling - the source-code name of the node

- the source-code name of the node location - the source-code location from which the node was parsed

- the source-code location from which the node was parsed get_children - its children nodes get_children requires special explanation, because this is a particular point at which the C and Python APIs diverge. The libclang C API is based on the idea of visitors. To walk the AST from a given cursor, the user code provides a callback function to clang_visitChildren . This function is then invoked on all descendants of a given AST node. The Python bindings, on the other hand, encapsulate visiting internally, and provide a more Pythonic iteration API via Cursor.get_children , which returns the children nodes (cursors) of a given cursor. It's still possible to access the original visitation APIs directly through Python, but using get_children is much more convenient. In our example, we use get_children to recursively visit all the children of a given node: for c in node.get_children(): find_typerefs(c, typename)

Some limitations of the Python bindings Unfortunately, the Python bindings aren't complete and still have some bugs, because it is a work in progress. As an example, suppose we want to find and report all the function calls in this file: bool foo() { return true ; } void bar() { foo(); for ( int i = 0 ; i < 10 ; ++i) foo(); } int main() { bar(); if (foo()) bar(); } Let's write this code: import sys import clang.cindex def callexpr_visitor (node, parent, userdata): if node.kind == clang.cindex.CursorKind.CALL_EXPR: print 'Found %s [line=%s, col=%s]' % ( node.spelling, node.location.line, node.location.column) return 2 # means continue visiting recursively index = clang.cindex.Index.create() tu = index.parse(sys.argv[ 1 ]) clang.cindex.Cursor_visit( tu.cursor, clang.cindex.Cursor_visit_callback(callexpr_visitor), None ) This time we're using the libclang visitation API directly. The result is: Found None [line=8, col=5] Found None [line=10, col=9] Found None [line=15, col=5] Found None [line=16, col=9] Found None [line=17, col=9] While the reported locations are fine, why is the node name None ? After some perusal of libclang 's code, it turns out that for expressions, we shouldn't be printing the spelling, but rather the display name. In the C API it means clang_getCursorDisplayName and not clang_getCursorSpelling . But, alas, the Python bindings don't have clang_getCursorDisplayName exposed! We won't let this stop us, however. The source code of the Python bindings is quite straightforward, and simply uses ctypes to expose additional functions from the C API. Adding these lines to bindings/python/clang/cindex.py : Cursor_displayname = lib.clang_getCursorDisplayName Cursor_displayname.argtypes = [Cursor] Cursor_displayname.restype = _CXString Cursor_displayname.errcheck = _CXString.from_result And we can now use Cursor_displayname . Replacing node.spelling by clang.cindex.Cursor_displayname(node) in the script, we now get the desired output: Found foo [line=8, col=5] Found foo [line=10, col=9] Found bar [line=15, col=5] Found foo [line=16, col=9] Found bar [line=17, col=9] Update (06.07.2011): Inspired by this article, I submitted a patch to the Clang project to expose Cursor_displayname , as well as to fix a few other problems with the Python bindings. It was committed by Clang's core devs in revision 134460 and should now be available from trunk.

Some limitations of libclang As we have seen above, limitations in the Python bindings are relatively easy to overcome. Since libclang provides a straightforward C API, it's just a matter of exposing additional functionality with appropriate ctypes constructs. To anyone even moderately experienced with Python, this isn't a big problem. Some limitations are in libclang itself, however. For example, suppose we wanted to find all the return statements in a chunk of code. Turns out this isn't possible through the current API of libclang . A cursory look at the Index.h header file reveals why. enum CXCursorKind enumerates the kinds of cursors (nodes) we may encounter via libclang . This is the portion related to statements: /* Statements */ CXCursor_FirstStmt = 200 , /** * \brief A statement whose specific kind is not exposed via this * interface. * * Unexposed statements have the same operations as any other kind of * statement; one can extract their location information, spelling, * children, etc. However, the specific kind of the statement is not * reported. */ CXCursor_UnexposedStmt = 200 , /** \brief A labelled statement in a function. * * This cursor kind is used to describe the "start_over:" label statement in * the following example: * * \code * start_over: * ++counter; * \endcode * */ CXCursor_LabelStmt = 201 , CXCursor_LastStmt = CXCursor_LabelStmt, Ignoring the placeholders CXCursor_FirstStmt and CXCursor_LastStmt which are used for validity testing, the only statement recognized here is the label statement. All other statements are going to be represented with CXCursor_UnexposedStmt . To understand the reason for this limitation, it's constructive to ponder the main goal of libclang . Currently, this API's main use is in IDEs, where we want to know everything about types and references to symbols, but don't particularly care what kind of statement or expression we see . Forgunately, from discussions in the Clang dev mailing lists it can be gathered that these limitations aren't really intentional. Things get added to libclang on a per-need basis. Apparently no one needed to discern different statement kinds through libclang yet, so no one added this feature. If it's important enough for someone, he can feel free to suggest a patch to the mailing list. In particular, this specific limitation (lack of statement kinds) is especially easy to overcome. Looking at cxcursor::MakeCXCursor in libclang/CXCursor.cpp , it's obvious how these "kinds" are generated (comments are mine): CXCursor cxcursor::MakeCXCursor(Stmt *S, Decl *Parent, CXTranslationUnit TU) { assert(S && TU && "Invalid arguments!" ); CXCursorKind K = CXCursor_NotImplemented; switch (S->getStmtClass()) { case Stmt::NoStmtClass: break ; case Stmt::NullStmtClass: case Stmt::CompoundStmtClass: case Stmt::CaseStmtClass: ... // many other statement classes case Stmt::MaterializeTemporaryExprClass: K = CXCursor_UnexposedStmt; break ; case Stmt::LabelStmtClass: K = CXCursor_LabelStmt; break ; case Stmt::PredefinedExprClass: .. // many other statement classes case Stmt::AsTypeExprClass: K = CXCursor_UnexposedExpr; break ; .. // more code This is simply a mega-switch on Stmt.getStmtClass() (which is Clang's internal statement class), and only for Stmt::LabelStmtClass there is a kind that isn't CXCursor_UnexposedStmt . So recognizing additional "kinds" is trivial: Add another enum value to CXCursorKind , between CXCursor_FirstStmt and CXCursor_LastStmt Add another case to the switch in cxcursor::MakeCXCursor to recognize the appropriate class and return this kind Expose the enumeration value in (1) to the Python bindings