Robert Nagy ( robert@ ) wrote in with a fascinating story of hunting down a recent problem with ports:

You might have been noticing the amount of commits to ports regarding autoconf and nested functions and asking yourself… what the hell is this all about?

I was hanging out at my friend Antoine ( ajacoutot@ )'s place just before EuroBSDCon 2017 started and we were having drinks and he told me that there is this weird bug where Gnome hangs completely after just a couple of seconds of usage and the gnome-shell process just sits in the fsleep state. This started to happen at the time when inteldrm(4) was updated, the default compiler was switched to clang(1) and futex es were turned on by default.

The next day we started to have a look at the issue and since the process was hanging in fsleep , it seemed clear that the cause must be futex es, so we had to start bisecting the base system, which resulted in random success and failure. In the end we figured out that it is neither futex nor inteldrm(4) related, so the only thing that was left is the switch to clang.

Now the problem is that we have to figure out what part of the system needs to be build with clang to trigger this issue, so we kept on going and systematically recompiled the base system with gcc until everything was ruled out … and it kept on hanging.

We were drunk and angry that now we have to go and check hundreds of ports because gnome is not a small standalone port, so between two bottles of wine a build VM was fired up to do a package build with gcc, because manually building all the dependencies would just take too long and we had spent almost two days on this already.

Next day ~200 packages were available to bisect and figure out what's going on. After a couple of tries it turned out that the hang is being caused by the gtk+3 package, which is bad since almost everything is using gtk+3. Now it was time to figure out what file the gtk+3 source being built by clang is causing the issue. (Compiler optimizations were ruled out already at this point.) So another set of bisecting happened, building each subdirectory of gtk+3 with clang and waiting for the hang to manifest … and it did not. What the $f?

Okay so something else is going on and maybe the configure script of gtk+3 is doing something weird with different compilers, so I quickly did two configure runs with gcc and clang and simply diff'd the two directories. Snippets from the diff:

-GDK_HIDDEN_VISIBILITY_CFLAGS = -fvisibility=hidden +GDK_HIDDEN_VISIBILITY_CFLAGS = -lt_cv_prog_compiler_rtti_exceptions=no +lt_cv_prog_compiler_rtti_exceptions=yes -#define _GDK_EXTERN __attribute__((visibility("default"))) extern -lt_prog_compiler_no_builtin_flag=' -fno-builtin' +lt_prog_compiler_no_builtin_flag=' -fno-builtin -fno-rtti -fno-exceptions'

Okay, okay that's something, but wait … clang has symbol visibility support so what is going on again? Let's take a peek at config.log:

configure:29137: checking for -fvisibility=hidden compiler flag configure:29150: cc -c -fvisibility=hidden -I/usr/local/include -I/usr/X11R6/include conftest.c >&5 conftest.c:82:17: error: function definition is not allowed here int main (void) { return 0; } ^ 1 error generated.

Okay that's clearly an error but why exactly? autoconf basically generates a huge shell script that will check for whatever you throw at it by creating a file called conftest.c and putting chunks of code into it and then trying to compile it. In this case the relevant part of the code was:

| int | main () | { | int main (void) { return 0; } | ; | return 0; | }

That is a nested function declaration which is a GNU extension and it is not supported by clang, but that's okay, the question is why the hell would you use nested functions to check for simple compiler flags. The next step was to go and check what is going on in configure.ac to see how the configure script is generated. In the gtk+3 case the following snippet is used:

AC_MSG_CHECKING([for -fvisibility=hidden compiler flag]) AC_TRY_COMPILE([], [int main (void) { return 0; }], AC_MSG_RESULT(yes) enable_fvisibility_hidden=yes, AC_MSG_RESULT(no) enable_fvisibility_hidden=no)

According to the autoconf manual the AC_TRY_COMPILE macro accepts the following parameters:

AC_TRY_COMPILE (includes, function-body, [action-if-found], [action-if-not-found]) Create a test program in the current language (see Language Choice) to see whether a function whose body consists of function-body can be compiled. If the file compiles successfully, run shell commands action-if-found, otherwise run action-if-not-found.

That clearly states that a function body has to be specified because the function definition is already provided automatically, so doing AC_TRY_COMPILE([], [int main (void) { return 0;}] , instead of AC_TRY_COMPILE([],[] will result in a nested function declaration, which will work just fine with gcc, even though the autoconf usage is wrong.

A quick example:

AC_INIT(foobar, 1.0) AC_PROG_CC CFLAGS="-Wall" AC_MSG_CHECKING([for -Wall compiler flag]) AC_TRY_COMPILE([], [int main (void) { return 0; }], AC_MSG_RESULT(yes), AC_MSG_RESULT(no))

GCC output:

checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking for -Wall compiler flag... *yes*

Clang output:

checking for gcc... clang checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether clang accepts -g... yes checking for clang option to accept ISO C89... none needed checking for -Wall compiler flag... *no*

The above example clearly shows that switching to clang as the default compiler triggered an undefined behaviour in autoconf due to the fact that people do not use autoconf the way it was intended and they only got away with it because they were using GCC.

After fixing the autoconf macro in gtk+3 and rebuilding the complete port from scratch with clang, the hang completely went away as the proper CFLAGS and LDFLAGS were picked up by autoconf for the build.

At this point we realized that most of the ports tree uses autoconf so this issue might be a lot bigger than we thought, so I asked sthen@ to do a grep on the ports object directory and just search for "function definition is not allowed here", which resulted in about ~60 additional ports affected.

Out of the list of ports there were only two false positive matches. These were actually trying to test whether the compiler supports nested functions. The rest were a combination of several autoconf macros used in a wrong way, e.g : AC_TRY_COMPILE, AC_TRY_LINK . Most of them were fixable by just removing the extra function declaration or by switching to other autoconf macros like AC_LANG_SOURCE where you can actually declare your own functions if need be.

Another gem from one of the ports as the last example :)

| int | main () | { | | #include "stdio.h"

The conclusion is that this issue was a combination of people not reading documentation and just copy/pasting autoconf snippets, instead of reading their documentation and using the macros in the way they were intended, and the fact that switching to a new compiler is never easy and bugs or undefined behaviour are always lurking in the dark.

Thanks to everyone who helped fixing all the ports up this quickly! Hopefully all of the changes can be merged upstream, so that others can benefit as well.