Extracting the abstract syntax tree from GCC

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

Richard Stallman recently revived a nearly year-old thread in the emacs-devel mailing list, but the underlying issue has been around a lot longer than that. It took many years before the GNU Compiler Collection (GCC) changed its runtime library exemption in a way that allowed for GCC plugins, largely because of fears that companies might distribute proprietary, closed-source plugins. But efforts to use the plugin API to add features to another GNU project mainstay, Emacs, seem to be running aground on that same fear—though there has never been any real evidence that there is much interest in circumventing the runtime library exception to provide proprietary backends to GCC.

Last March, a debate about exporting the abstract syntax tree (AST) information from GCC to Emacs (and other programs) was ongoing in various mailing lists. That discussion was about adding an Emacs feature to auto-complete various program-specific identifiers (e.g. variable names, types, structure member names, etc.). It was an offshoot of another wide-ranging discussion that we covered back in January 2014. When the conversation dropped last March, Stefan Monnier had responded to Stallman:

[...] for Emacs the issue is to get detailed info out of GCC, which is a different problem. My understanding is that you're opposed to GCC providing this useful info because that info would need to be complete enough to be usable as input to a proprietary compiler backend.

On January 2, Stallman renewed the conversation by noting that he hoped "we can work out a kind of 'detailed output' that is enough for what Emacs wants, but not enough for misuse of GCC front ends". He was looking for people to help define that "detailed output", but instead found a number of people that felt that exporting the full AST information would be more sensible.

Stallman is concerned that proprietary backends could take the AST output and generate code from it. While no one in the thread wanted to see that happen, most also saw it as an unlikely outcome. David Engster said that he had been working on a way to get the AST information out of GCC for Emacs and noted that there was no technical barrier to doing so:

Anyone can write a GCC plugin that simply outputs the AST in some form. It's not that hard. The plugin itself would have to export the symbol 'plugin_is_GPL_compatible', but of course you can't control what's done with the output. No [one] bothers with this though, because everyone just uses libclang.

While the original discussion was largely about auto-completion, Engster and others would eventually like to go further than that. Their vision is to turn Emacs into a more full-featured integrated development environment (IDE), which would require all of the AST information, at least in their eyes. Stallman would prefer providing far less information: "just enough to do the completion and other operations that Emacs needs, and NO MORE."

Engster replied that he understood Stallman's concerns, but felt that there was no real problem:

For almost five years now (since gcc 4.5 introduced plugins), access to GCC's AST is wide open for everyone. However, in all that time (and to my knowledge) no one has used that to feed non-free backends, and that is in my opinion enough evidence that your worries are unfounded. They might have been valid in the past, but not since LLVM and clang have joined the scene.

He went on to say that if Stallman was opposed to using the AST in Emacs, he would drop the work he was doing in that area (for auto-completion and other IDE-like features). Stallman remained unconvinced of the need for the AST information, saying "it is so important to avoid the full AST that I'm not going to give up on it just because someone claims that is necessary".

But Perry E. Metzger noted that he is doing "a bunch of complicated refactoring work in conjunction with my current academic research" and that the lack of AST information from GCC forced him to use Clang/LLVM. He would like to see Emacs gain more IDE-like features that other tools, such as Apple's Xcode, have:

The libclang AST manipulation functionality was originally created because Apple wanted it to enable sophisticated IDE and refactoring capabilities in XCode's editor. I have wanted to have all that stuff in Emacs forever, but it is hard without having access to tools that generate a full AST of the code being examined. For Emacs to be able to compete with non-free tools, it is important that it have access to similar capabilities as the non-free tools have.

Stallman would like to see some kind of investigation to determine what pieces of information are needed for which purposes (e.g. auto-completion, refactoring, and so on), but that seems short-sighted to some. Metzger gave several examples of things that could only be done easily by having the full AST. Object-oriented languages, in particular, have programs with complex inter-relationships that can only be untangled by using all of the information that the compiler has collected in the AST.

Even if various subsets of the AST information could be defined to enable particular IDE features, new IDE features come about regularly. As David Kastrup pointed out, adding new interfaces for each new set of requirements makes little sense and would require that the Emacs IDE features and GCC versions be tightly coupled. Either that or the GCC plugin interface needs to be stable and provide a highly general API into the guts of the compiler, which could also be used in "bad" ways:

If we want the editing modes to be able to evolve without being constrained to the GCC release cycles, we need either rather generically useful data, or generically useful GCC extension interfaces. Keeping all of our babies while throwing out all of the bath water for evolving notions of babies and bathwater is not likely to work well. I don't really see that we have much of a chance not to be hobbling us ourselves more in the process than currently hypothetical leeches.

There appear to be a few disconnects in the conversation. Stallman is focused on auto-completion, and believes that it can be done without access to the full AST. Others disagree, especially for languages with operator overloading such as C++. Stallman said that he has never used C++, so he is trying to understand what would be needed to support auto-completion for C++ in Emacs. However, the discussion has also included adding support for more than just auto-completion, but Stallman is not ready to look into additional features yet. Even if it is possible to handle C++ (and the other languages GCC supports) auto-completion without all of the AST information, there are plenty of examples in the thread (mostly from Metzger) of IDE features that do require the AST (or enough information in other forms that it would be functionally equivalent to the AST).

It is clear that Stallman has not used any of the "competing" IDEs (e.g. Xcode, IntelliJ IDEA), which is not a huge surprise, but he is clearly feeling browbeaten about the issue: "Rather you are trying to pressure me to do what you want, at the expense of something I consider important." But Metzger and others have clearly stated that they understand (and, in general, agree with) Stallman's concerns, they just see the tradeoff differently than he does. In fact, Metzger said, there is a freedom issue at stake:

By forbidding the editor from having advanced knowledge of the parse tree, you are intentionally crippling the ability of smart people to build better and better free software tools. You are preventing smart hackers from going in and making the system as good as they can make it, from expressing their creativity in such a way as to build the best development environment they know how.

That is, of course, a hot-button issue for Stallman, who takes umbrage at that characterization. Furthermore, he wants to take some time to study the problem(s), without being pressured:

What I intend to do is investigate these issues thoroughly _one by one_ to see what options exist for each, and what is good or bad about them. I will think about refactoring when I understand it well enough to be able to judge arguments myself. First I will learn about it from people who are not trying to pressure me about it.

Metzger volunteered to help with the process, but noted that it goes well beyond auto-completion. Emacs is a powerful tool that, unlike most of the other IDEs, gives its users the ability to reprogram the way it works. But that requires flexibility:

I recognize that you don't want me to "change the subject" to refactoring, but I don't see this as a change of subject. The concern isn't as such code completion, that's just a detail. The underlying concern is being able to make Emacs the very best programmer's editor it can possibly be. Modern development environments have astonishing power, and there is an enormous desire on the part of certain parts of the Emacs community to have that power or even more available in Emacs.

Beyond that, Kastrup and others felt that Stallman was being unfair to Metzger by characterizing his comments as "changing the subject". Karl Fogel pointed out that Metzger and others have all acknowledged Stallman's concerns, but that Stallman has not done the same:

For some reason you refuse to acknowledge that others are cognizant of this tradeoff. They have been very patiently making a detailed argument for why, in this particular case, you are choosing the wrong side of that tradeoff -- the side that will be *less* effective at accomplishing our shared goal. They make this argument, with impressive clarity, and then you accuse them of bad faith. This would be poor behavior even if those people were wrong. I think they're actually right, though, which makes it even worse, because now our goal is being damaged too.

It would seem that the goal of more IDE-like features for Emacs has suffered a setback. Based on Stallman's responses, Engster said he would not continue working on his project to incorporate AST information into Emacs:

It was you who told me to abandon libclang and choose GCC instead. And now that I'm working on that, I only get confronted with vague restrictions like "you may only export what you need for completions".

The problem that Stallman foresees is that the AST information could be used by a non-GCC backend that wouldn't use libgcc and would thus evade the GCC plugin restrictions that were added to the runtime library exemption. But as Óscar Fuentes pointed out, that has already been done by the DragonEgg project, which "is now abandoned, mostly because Clang is a better front-end than GCC". Because LLVM does not have the freedom-respecting requirements that Stallman holds dear, it also takes a much more modular approach that is easier for other projects to interface with.

Another potential problem is the possibility of a fork of Emacs to support using a GCC plugin that exports the AST, which would be legal under both the runtime library exemption and the Emacs license. The only barrier would be if Stallman was unwilling to accept the code into Emacs. Monnier said that, under those circumstances, he would be "willing to consider a fork if that's what it takes".

Kastrup noted that Stallman has laid out a plan for getting the understanding he needs to proceed with these features. That will take time, however, which may have other negative impacts:

Richard said he'll discuss the issue with people he trusts. And frankly, this sounds like a responsible thing to do even if it is frustrating to people who'd like to see a more prompt reaction to their input. It would be a pity if by the time this process comes to a conclusion there is nobody interested in making it count any more.

There appears to be a fairly wide chasm between the two sides of the debate. It is hard to see how an Emacs IDE mode for GCC can compete with the proprietary alternatives without some mechanism to extract the AST from the compiler. To Stallman, at least, that is not of paramount importance, while others see things differently. The risk is that both Emacs and GCC decline in both usage and developer mindshare while some kind of solution is found. That would seem to make it worth coming to a solution sooner rather than later, no matter which side of the debate one is on.