The Compact C Type Format in the GNU toolchain

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

The Compact C Type Format (CTF) is a way of representing information about a binary program; it can be seen as a simpler alternative to the widely used DWARF format. While CTF has been around for some years, it has not seen much use in the Linux world. According to Elena Zannoni, who talked about CTF at the 2019 Open Source Summit Japan, that situation may be about to change; work is underway to bring CTF support to the GNU tools shipped universally with Linux systems.

Compiling a program into its binary form discards a lot of information found in the source code; that information can be needed when the time comes to track down a bug in the compiled program. To facilitate this work, compilers create debugging information that records the names and types of the variables used by a program, along with function names, the line numbers in the source program, and more; this information is then stored in one of many formats. DWARF is by far the most commonly used format on Unix-like systems, but it is not the only one.

Given the dominance of DWARF, one might wonder why anyone would want to work on alternatives. One problem is that DWARF is complex; rather than containing straightforward information about a program, a DWARF entry is essentially a program in its own right that can be run to generate the needed information. That makes DWARF flexible, but it's also complicated and verbose; the DWARF data associated with a program can be huge. That size means that, on most systems, the DWARF data for the installed programs is relegated to "debuginfo" packages that are not even present unless the owner has gone out of their way to install them.

CTF was created out of a desire to be able to perform most debugging tasks even in the absence of debuginfo packages, and to be able to do so in a simpler and faster way. DWARF can also expose a lot of information about the program source that some companies might wish to keep to themselves; CTF contains a lot less unneeded information. The CTF format was first created for the Solaris system, but has been used with the Linux DTrace port since 2012. There is, she said, nothing DTrace-specific about CTF, though. It is also available on FreeBSD and macOS.

The key difference between CTF and DWARF, perhaps, is that CTF limits itself to modeling the type system and managing the mapping from symbol-table entries to specific types. DWARF is much more ambitious, she said, modeling everything relating to the C language and how it maps to the hardware. CTF's simplicity means it can omit location lists, stack machines, and a lot of other machinery.

Bringing CTF to Linux requires contributing a lot of code upstream. One piece of the puzzle is adding support to GCC; the new -gt option will cause the compiler to generate CTF data. Getting that data into the final executable requires support in the binutils package as well; this includes enhancing the linker as well as adding support to tools like objdump and readelf . Work is also being done to get CTF support into the GDB debugger. Zannoni showed some sample output from the size utility; the CTF data (stored in an ELF section named .ctf ) required about 213KB of space, as compared to over 4MB for the DWARF data for the same program. DWARF data for the kernel requires 1.6GB; the CTF data fits in just under 7MB.

CTF and DWARF data can coexist in the same ELF file, she said, since the CTF data has its own dedicated section. The CTF data is naturally smaller, but the format also includes compression to reduce the size requirements further. The result is that this data, unlike DWARF information, need not be stripped to get the executable file down to a reasonable size. Thus, while DWARF data is normally shipped in separate debuginfo packages, CTF data is easily included in the binary package and can be always available.

Internally, CTF data is stored in a structure called a "container" or a "dictionary". Each dictionary contains a header and a number of subsections dedicated to data like function information, variable information, types, and a string table for names not already present in the ELF string table. The header starts with a magic number (useful for determining the endianness of the rest of the data), a version number, and a set of flags. Version one is the original Solaris CTF, while version two was created during the porting of DTrace to Linux. It mostly increases a number of limits found in the first version. The third version is still being defined; it will include a number of header changes among other things. There is even a fourth version in an "initial planning stage". The intent is to keep this data ABI compatible, though, she said.

Returning to linker support, Zannoni noted that GCC will place a single .ctf section in each object file it creates. The linker then has to take these sections and merge them into a larger section, removing any duplicate information. There is a potential problem, though, in that different object files may define conflicting objects using the same names. When this happens, the linker will create a child dictionary associated with a specific translation unit for the conflicting data. Most of the time, though, a linked executable will contain one large shared CTF dictionary and perhaps a small number of tiny subdictionaries.

There is a libctf library being added to the binutils package that implements the ability to read and write CTF data; it is used by the compiler, the linker, and the debugger. This library, along with the readelf and objdump changes, were merged into the binutils trunk in May; the linker changes have been posted but need some more work before they can be merged. The hope is that all of the CTF support will land in binutils for the upcoming 2.33 release.

The GCC patches have been posted a few times; they too are being modified in response to review comments. One piece that has not yet been posted is link-time optimization support, but it is coming soon. With luck, she said, all of this support will be merged in time for the GCC 10 release due in 2020. GDB support is also under discussion on the project mailing list; the GDB 8.4 release is being targeted for this work.

Zannoni closed with a look at where things go from here. There is a set of discussions planned for the Toolchains microconference at the 2019 Linux Plumbers Conference. There are, evidently, still optimizations that can be made to further reduce the size of CTF data. There is also a fairly significant gap in that backtrace support is not yet present for CTF. An expansion to languages other than C is on the horizon. Then, there is that perennial lowest priority for development teams: documentation. The "specification" for the format lives in a C header file for now; that will clearly need to change in the future.

[Your editor thanks the Linux Foundation for supporting his travel to the event.]

