C, Fortran, and single-character strings

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

The calling interfaces between programming languages are, by their nature, ripe for misunderstandings; different languages can have subtly different ideas of how data should be passed around. Such misunderstandings often have the effect of making things break right away; these are quickly fixed. Others can persist for years or even decades before jumping out of the shadows and making things fail. A problem of the latter variety recently turned up in how some C programs are passing strings to Fortran subroutines, with unpleasant effects on widely used packages like LAPACK

The C language famously does not worry much about the length of strings, which simply extend until the null byte at the end. Fortran, though, likes to know the sizes of the strings it is dealing with. When strings are passed as arguments to functions or subroutines, the GCC Fortran argument-passing conventions state that the length of each string is to be appended to the list of arguments. Consider a Fortran subroutine defined something like this:

subroutine foo(i, s) integer i character s ...

When that subroutine is called from other Fortran code, the length of s will be added by the compiler as a third, "hidden" argument. The C compiler will do no such thing, though, so the proper call to that function from C would look like:

int i; char *s = "bar"; foo(&i, s, strlen(s));

From C, the length of s must be passed explicitly the end of the list of arguments.

At some distant point in the past, though, somebody decided that the hidden length argument should be omitted for single-character strings — those that are declared " character *1 " in the called function, for example. It is not clear that any Fortran compiler anywhere ever implemented that behavior, but developers writing calls from C developed the habit of leaving out the length in that situation. As long as the called code knew that it was getting a single-character string, it would not need to check the (missing) hidden length parameter; everything worked, even though the calling standards were being violated. Various LAPACK subroutines expect single-character strings, and packages like CBLAS and LAPACKE duly leave out the length argument when calling them.

Once again, this is not how these functions are supposed to be called, but things worked anyway. At least, until they broke. It seems that the problem was originally worked out by Thomas Kalibera in the R language community: a fix for an unrelated ABI issue caused crashes with some LAPACK calls. After, seemingly, a great deal of analysis work, Kalibera figured out where things go wrong. A subroutine taking a single-character string would call another just prior to returning, passing the same string. The compiler would optimize that call into a tail call (or more properly a "sibling call" using the same parameters); prior to making the jump, it would helpfully store the string length at the end of the argument list. But that length wasn't there to begin with, and no space had been allocated for it, so the result was an unsightly stack traceback. The problem can be worked around by compiling the Fortran code with the ‑fno‑optimize‑sibling‑calls option.

This behavior was reported as a GCC bug on May 3. Thomas Koenig responded:

OUCH. So, basically, people have been depending on C undefined behavior for ages, and this includes recent developments like LAPACKE. Only an accident of calling conventions has kept this "working". Oh my...

The code that fails with new compilers is widely understood to actually have been broken (if "working") for years. Richard Biener suggested that the solution was to tell the affected users to fix their code. That suggestion did not go far, though, and the GCC developers took the problem seriously; it is not a good thing for a compiler update to break code that was working before. So a solution had to be found, but it wasn't clear what the best solution would be. Simply reverting the ABI fix was not an option, since that would reintroduce a real bug of its own.

After some discussion of options that were shown not to be real solutions, Koenig returned to the use of ‑fno‑optimize‑sibling‑calls which, he said, "restores the status quo because things would go back to being fragile, nonconforming, and they would work again." He suggested that this option could be enabled by default for GCC versions 7, 8, and 9, since the code in question used to work when built with those versions. For the upcoming GCC 10 release, developers would be warned and would have around a year to fix their code.

Unfortunately, as Jakub Jelinek pointed out, that solution was not viable either. There are programs performing recursive tail calls to a significant depth; turning those tail calls into real function calls would cause them to run out of stack space and crash. He suggested instead trying to avoid the tail (or sibling) calls only when there are string arguments involved. Koenig took this work and attached it to a new ‑fbroken‑callers option, which would be enabled by default in updates to older GCC releases.

Jelinek didn't like the name of that option, though, so he reworked it into one called ‑ftail‑call‑workaround . Setting that option to two gives the same behavior as ‑fbroken‑callers , while setting it to one (the default value) limits the workaround to calls to functions without explicit prototypes. Setting it to zero disables the workaround entirely. This code has been backported for the (future) 8.4 and 9.2 releases (so far), and will appear in 10.1 as well, perhaps with a different default value.

Weinberg's second law states that "if builders built houses the way programmers built programs, the first woodpecker to come along would destroy civilization". Situations like this, where the interface between functions has been misunderstood for years, would appear to be a case in point. There is a lot of code on our systems that appears to work fine, but which is really just waiting for a woodpecker to come along and poke a hole in the right place. The GCC developers have worked out how to patch over this particular problem, but there is certainly still plenty of code out there that should never have worked, but which seems to — for now.

[Thanks to Dave Williams for the heads-up on this issue.]

