March 2001/Navigating Linux Source Code

Introduction

Like many programmers, Ive had to maintain and enhance very large software products not of my own creation. (In my case, each of these products had at least 1.5 million lines of code.) Frequently, the code was poorly documented and consisted of numerous interrelated programs whose source was distributed throughout a multitude of directories. As is often the case with large, complex, multi-platform software systems, using an IDE (Integrated Development Environment), with its attendant conveniences, is not a realistic option. Moreover, familiarizing yourself with such code bases is a daunting task in which even finding a function, type, or macro definition can be quite time consuming. Fortunately, there are several open source tools available, all derivatives of the Unix ctags program, to assist you in understanding and navigating large software systems. These tools include the original Unix ctags program, the ID Utilities, and the GLOBAL source code browser.

All of the tools are based on the concept of a tag. A tag is an index entry within a tags file of a programming language identifier. The tags indicate the locations (in terms of file/line number pairs) where the identifier is defined and referenced. Programming language identifiers include functions, structure and union names, type names, and variable names (essentially, anything that is entered into a compilers symbol table), as well as macros. As an example of how tags are used, the ctags program (which also parses source code files and generates a file of tags) allows these identifiers to be quickly located while you are editing source code within a text editor. For instance, if youre coding in the vi text editor and you need to see the prototype for a particular C function, the tags mechanism can immediately take you to it with a Ctrl-t keystroke.

While the GLOBAL, ID Utilities, and ctags programs are all built upon a tags core and are all designed to help you rapidly navigate source code, the circumstances under which each tool is best applied differ significantly. Source code browsers such as GLOBAL, which map your source code to HTML, are most useful for an initial examination of existing code, whereas ctags and the ID Utilities are essential for your day-to-day programming work. Ill begin by examining the Unix ctags program.

Using Tags

The Unix ctags command creates a tags file, which contains indexes of programming language identifiers. These indexes specify the file/line number pairs that indicate where the identifiers are defined or referenced. The tags file is used by the :tags vi editor commands to position the editors cursor on the objects definition within the appropriate file. The ctags command may be used either at the colon prompt within vi or outside the editor from the shell command line. (Note that tags may be used within EMACS and a variety of other popular editors for Linux, Unix, and Windows. I use vi as a convenient and ubiquitous example.)

Creating the Tags File

When executing ctags from a Unix shell (or MS-DOS) command line, the current working directory should be a source code directory. For example, to create a tags file for C source and header files on a Unix system, execute the following command: ctags -t *.c *.h. Many versions of the ctags command support the -t option, which creates tag file entries for user-defined types. You should always use this option when it is available.

The ctags command universally supports C files (and usually Pascal and Fortran as well), but, with rare exceptions, C++ support is not available. Fortunately, there are extended versions of ctags, notably Darren Hieberts Exuberant ctags, that support C++, not to mention C, Fortran, Java, and even Eiffel. For more information, please refer to the Exuberant ctags home page: http://home.hiwaay.net/~darren/ctags/.

Using Tags within the vi Editor

From within the vi editor, the command :tags identifier will position the editors cursor on the definition of the identifier, which may be either a function name or a user-defined type. A more convenient method is to use the Ctrl-t key sequence from within the vi editors command mode. Position the cursor on the first character of the identifier while in command mode (not insert mode) and press Ctrl-t.

Returning to your original location within the editor is just as important as moving to the definition of the identifier. The :tags command (or Ctrl-t key) may transition the editor to an entirely new file, or it may reposition the cursor within the same file. To return to the original file you were editing, use the :e # command. To return to your original position within the same file, type :'' (a colon followed by two single quotes).

As with any Unix command, the ctags command may be executed from within vi while in command mode by preceding the command with an exclamation mark, as in: :!ctags *.c. The exclamation mark tells vi to create a shell and treat the command that follows as a shell command. (Actually, the colon invokes ex, the editor upon which vi is built).

These simple vi commands can save you a great deal of time and trouble traversing your source code. Fortunately, if you need to maintain legacy code on even the oldest of Unix workstations, ctags will most likely be present (or you can obtain a version of ctags for that workstation). Ive found it on every Unix platform Ive used. Considering its ubiquity, utility, and longevity, Im surprised at how few programmers use ctags or are even aware of it. While ctags is an excellent tool and well worth the brief time needed to learn it, the more recent ID Utilities, which may be thought of as an augmented ctags, offer you even more functionality and convenience.

Using the ID Utilities

The ID utilities (ID stands for Identifier) are a set of text-search programs that query an ID database for tokens such as C identifiers (variables, functions, user-defined types), strings, and constant literals. The ID database is a binary file that contains a list of tokens along with a list of files (and line numbers within those files) in which the tokens appear. The ID database is essentially a tags file; it implements a sparse matrix that associates file/line numbers with tokens. The ID utilities are invaluable when youre searching your code base for type definitions, string literals, or references to variables and functions, particularly when the source files span many directories. The search times are remarkably fast as well. (The sizes of the source code bases Ive dealt with have precluded using the Unix find and grep commands in combination; this approach is unacceptably slow.)

The ID Commands and Aliases

The ID Utilities include the following commands:

lid  queries the ID database for tokens and then reports matching filenames with matching lines.

fid  lists all tokens recorded in the database for given files or tokens common to two files.

fnid  matches the filenames in the database, rather than the tokens.

The following very useful aliases are also provided:

gid  an alias for 'lid -R grep', lists all lines containing the requested pattern.

aid  an alias for 'lid -ils', treats the requested pattern as a case-insensitive literal substring.

eid  an alias for 'lid -R edit', invokes an editor on all files containing the requested pattern and, if possible, initiates a text search for that pattern.

The three commands Ive found to be indispensable are gid, lid, and fnid.

The gid Alias

The most useful of the ID Utility commands is gid. Lets say that youre interested in finding all uses of the INXl_pg_typ user-defined type. Executing the command gid INXl_pg_typ will produce a list of files (with line numbers) and the text of the source line where INXl_pg_typ appears.

A sample output line is show below:

/rcs_source/INXl_tabmaint.c:297: INXl_pg_typ *gp;

The lid Command

The most versatile of the ID Utility commands is lid. The lid commands many options (from which the gid alias is derived) will be displayed when the --help argument is used (note that two dashes are used). (The --help option may be used with the other ID Utility commands as well.)

An extremely useful option to lid is -l, which interprets a specified pattern as a literal string. Suppose you want to find all uses of the string "doc". The lid -l doc command will produce output similar to the following:

doc/rcs_source/rdb/inc/RDB.h /rcs_source/rdb/inc/retrieve.h

The specified string is followed by several entries indicating the files where it occurs.

Other options to lid allow the use of regular expressions, the matching of decimal or hex numbers, and finding identifiers based on the frequency of their use.

The fnid Command

The fnid command locates the subdirectory where a source file can be found. For example, the command fnid docimport.c returns /rcs_source/os/docimport/src/docimport.c indicating that the file docimport.c is located in the os subsystem in its own subdirectory of my source code tree. This command is extremely fast. Using the find command and piping it to grep to accomplish this is unacceptably slow for large code bases, but fnid does the job quite handily.

Setting up the ID Utilities

Installing the ID Utilities on Linux and Unix platforms is accomplished by running a configuration script and then executing the command make install, a procedure common to many shareware programs. The ID database is built by invoking the remarkably fast mkid command. The mkid command should be scheduled as a cron job to regularly update the ID database during off periods. (The ID Utilities have been ported to Windows NT, but they are not yet readily available at the time of writing.)

The ID Utilities include scanners for C and C++ and provide support for lex and yacc. A scanner for assembly language files is also provided. You may also define your own language scanners. The rules for mapping scanners to filename extensions are defined in the file id-lang.map, allowing the appropriate scanner to be invoked for files with non-standard extensions. (All of the legacy systems Ive worked on have used non-standard extensions for header files devoted to specific purposes. For example, .defs has frequently been used to denote header files containing type and macro definitions.)

The ID Utilities are an indispensable part of my programming toolkit. The speed and versatility of the ID Utility programs make them ideal for large legacy systems, where they supplement (and largely supplant) ctags and outperform the grep and find programs. Ive found using the ID Utilities (and ctags) much more convenient than the pull-down menus of an IDE to access function references or definitions. The ID Utilities are available from www.gnu.org.

Familiarizing yourself with the source code of a large legacy system can be overwhelming. Both ctags and the ID Utilities can assist you in doing this, but more suitable tags-based tools are available. Source code browsers that map your code to HTML using an underlying tags mechanism are ideal for this task.

Source Code Browsers

There are many source code browsers available that convert source code to HTML and use a tags-like system to create hyperlinks between the use and definition of identifiers. Within the generated HTML, a declaration involving a user-defined type will be linked to its definition in, for example, a header file. These tools are extremely valuable when you are first familiarizing yourself with a large code base. They usually display the source code in a manner similar to that of a context sensitive editor, with keywords highlighted in bold, comments and strings displayed in different colors, and so on.

GLOBAL

An excellent example of a tags-based source code browser is GLOBAL. GLOBAL has been used to generate source code in HTML form (suitable for standard web browsers) for both the Linux and FreeBSD operating systems C sources. Figure 1 shows the main index page that GLOBAL created for Linux version 2.0.35, displayed in the Netscape browser. It indicates the files that contain main functions (and thus, the files that are top-level files for various executable programs). It provides a complete function index and an index of the files that comprise each subsystem.

Figure 2 shows the source code for the Linux kernels do_fork function. Note that C keywords are shown in bold, comments in green, and function definitions are hyperlinked to a list of locations (filename/line number) from where they are called. Function calls are hyperlinked to their definitions. User-defined types are hyperlinked in a similar manner.

A navigation bar is embedded into a C comment at the beginning of each function. It allows you to go forward or back one function in the file, go to the top or bottom of the file, access the function or file index page, or go to the navigation help page. The navigation bar is show below.

/* [<][>][^][v][top][bottom][index][help] */

GLOBAL 3.52 has had difficulty with some of the macros used in my source code, in particular any that expanded to white space, and it has generated a few incorrect function references and hyperlinks. This is a minor problem and doesnt compromise GLOBALs utility.

Generating an HTML Version of a Source Directory

Once the GLOBAL executables and libraries are built, youre ready to generate an HTML version of the source code. There are two GLOBAL commands used to generate the HTML: gtags and htags. gtags is very similar to the ctags command discussed above. It generates the tag database (GTAGS, GRTAGS, and GSYMS files), which is then used by the htags HTML generator command. The commands are as follows:

> gtags # make the tag database(GTAGS,GRTAGS,GSYMS) > htags # make the hypertext(HTML/)

htags creates a subdirectory called html in the directory from which it is invoked. You can then browse the source code by starting at htm/index.html. Assume that the amount of disk space required for the HTML is about five times that of the original source code. This being the case, you may want to generate the HTML in another directory. The -d <tagdir> option may be used to specify the directory where the tag database resides. If a directory pathname is specified without an argument, it is taken to be the target directory in which the html directory will be placed. For example:

> cd/rcs_source/is/src > gtags > htags /home/jbonang/INX

In this case, the HTML for the source code in directory /rcs_source/is/src will be generated in directory /home/jbonang/INX. Note that gtags is an executable (a file named gtags.exe on Win32 platforms) and that htags is a Perl script (a file named htags.pl).

Once the HTML is generated, you can begin browsing your source code. The initial page will display the main programs of your software system. I usually begin studying a new software system by first getting a quick overview of the executable programs. After ranking them in order of importance, I use GLOBAL to explore the major functions and data types of each.

Global is available for Linux and most Unix platforms, as well as Win32. It requires Perl and, for Win32 systems, the following Unix command line utility programs: sed.exe, sort.exe, and uniq.exe. For more information on GLOBAL, please refer to the following website: http://www.tamacom.com/Unix/index.html#global.

On the Horizon

GLOBAL is an excellent HTML source-code browser generator and is the precursor of yet more powerful programming environments known as Hypercode environments. A Hypercode environment will allow you to edit source code represented as HTML within a browser and much more. You will be able to create links from source code to on-line formal documentation; to other source code files; to technical reports that, for instance, describe an algorithm you used; or even to email, chat transcripts, and videos. (Microsofts yet to be released C# programming language uses attributes that allow hyperlinks to external documentation to be specified within source code files. I anticipate the open source community will introduce similar functionality into established languages and development tools in the near future.) For more information on Hypercode environments, visit the MIT Artificial Intelligence Lab website: http://www.ai.mit.edu/projects/transit/rc-demo/demo.html.

Summary

With the ctags command and the ID Utilities, you can find function, macro, typedef, and variable definitions, as well as the location of string literals, quickly and conveniently from within your favorite text editor. You can also find where particular source files reside within a complex directory structure. Source code browsers, such as GLOBAL, convert your source code to HTML, allowing you to conveniently peruse it in a standard browser and understand it rapidly. Taking advantage of these tag-based tools will ease your software maintenance chores and expedite your new development projects. You can also look forward to the introduction of Hypercode environments that will make developing and supporting large software products significantly easier.

References

[1] Linda Lamb and Arnold Robbins. Learning the vi Editor, 6th Edition (OReilly & Associates, Inc., 1998).

[2] Darren Hieberts Exuberant ctags home page: http://home.hiwaay.net/~darren/ctags/.

[3] The GLOBAL home page: http://www.tamacom.com/unix/index.html#global.

[4] The GNU home page (where the ID Utilities may be obtained): http://www.gnu.org.

[5] The MIT Artificial Intelligence Labs Hypercode page: http://www.ai.mit.edu/projects/transit/rc-demo/demo.html.

James Bonang is a Principal Engineer at FileNET Inc. and may be reached at [email protected].