Reading Code From Top to Bottom

This is a particularly simple example of a style of code that I often see (this example is a piece of prman, sanitized to hide the function of the code it comes from):

int match(struct table *tab, char *arg) { int i, found = 0; if (tab != NULL) { for(i=0; i<tab->nitems; ++i) if (tab->item[i].name == arg) break; found = (i < tab->nitems); } return found; }

tab

arg

I would write the above like this:

int match(struct table *tab, char *arg){ int i; if(tab==NULL) return 0; for(i=0;i!=tab->nitems;i++) if(tab->item[i].name==arg) return 1; return 0; }

found

Most important, it reads from top to bottom, by which I mean that each step follows from the preceding step, and the code stops (that is, returns) as soon as it knows the answer. It tells a simple story:

If you want to find an element of tab->item with name==arg , First check that tab points to something.

points to something. Then check each element of the pointed-to array, returning if you find a match.

If you didn't find it, it's not there.

If you want to find an element of tab->item with name==arg , First, assume its not there.

Then check that tab points somewhere. If so go through the array until you see what you're after. When you're done looking, you found a match if and only if you didn't go through the whole array.

points somewhere. If so Having done all that, inform the caller about what you determined.

During the Structured Programming Era, programmers were often taught a style very much like the old version. The language they were trained in was normally Pascal, which only allowed a single return from a procedure, more or less mandating that the return value appear in a variable. Furthermore, teachers, influenced by Bohm and Jacopini (Flow Diagrams, Turing Machines and Languages with only two formation rules, Comm. ACM 9#5, 1966, pp 366-371), often advocated reifying control structure into Boolean variables as a way of assuring that programs had reducible flowgraphs.

But reducible flowgraphs can still be deeply nested and therefore hard to read. Long chains of reasoning or long sequences of actions are not a particular problem to understand, especially for people of a mathematical or otherwise analytic bent. Sequences that branch, so that a particular step may have multiple antecedents, all but one of which will be far from it on the page are a different matter. Reading them requires keeping in the mind a list of suspended options while paying attention to something else, a combination of mental activities that people are notoriously bad at. You should try as much as possible to structure your programs so that they proceed step-by-step. Whenever nesting gets deep, or code refers to results computed more than a few steps previously, you're headed for trouble.

A common indicator of trouble with excessive nesting is narrow tab stops. Deeply nested code, at eight spaces per tab (as is usual on UNIX) quickly migrates over to the right margin. The easy way to fix this is to indent four spaces at a time (or even two!). The right way is to rewrite it so that it proceeds from top to bottom, not from outside to inside, left to right.

A similar phenomenon happens in mathematical proofs. Most proofs involve a chain of implications, each usually following from its immediate prececessor. When the proof requires a step that follows from some non-immediate predecessor, we are brought up short -- violation of the orderly step-by-step progress of the proof disturbs the narrative rhythm of the proof, and is a likely signal of a difficulty, either in the proof or in the reader's understanding. Mathematicians routinely hide these narrative anomalies in lemmas -- proof subroutines.

In fact, there is evidence that mathematicians choose the theorems that they work on to be amenable to proof using simple structures. For example, when Doug McIlroy investigated the problem of automatically proving theorems of Euclidean geometry (personal communication, 1984 -- I don't believe this work was ever published) he discovered that when theorems are rewritten as systems of simultaneous equations (sometimes inequalities are required as well) the systems are usually `triangular' (they're not generally linear, so they're not strictly triangular, but you know what I mean) and the proofs mostly proceed immediately by back-subsititution.

Else Considered Harmful

It is often useful to eschew the else clause of the if statement. Instead of writing

int f(int x){ int v; if(b(x)) v=f1(x); else v=f2(x); return v; }

int f(int x){ if(b(x)) return f1(x); return f2(x); }

v=f1(x)

int g(int x){ if(b1(x)){ if(b2(x)){ if(b3(x)){ if(b4(x)){ if(b5(x))return g1(x); return g2(x); } return g3(x); } return g4(x); } return g5(x); } return g6(x); }

int g(int x){ if(!b1(x)) return g6(x); if(!b2(x)) return g5(x); if(!b3(x)) return g4(x); if(!b4(x)) return g3(x); if(!b5(x)) return g2(x); return g1(x); }

Another Example

int a(int i, int j){ return i?a(i-1, j?a(i, j-1):1):j+1; } int ackerman(int i){ return a(i, i); }

?:

a

int a(int i, int j){ if(i){ if(j) return a(i-1, a(i, j-1)); return a(i-1, 1); } return j+1; }

int a(int i, int j){ if(i==0) return j+1; if(j==0) return a(i-1, 1); return a(i-1, a(i, j-1)); }

An Aesthetic Conundrum

I don't like side effects in conditionals (or embedded in most expressions, but let's just think about conditionals for now.) Code almost always reads better if you separate the side effects from the tests. For example, I write

do c=getchar(); while(strchr(" \t

", c));

while(strchr(" \t

", c=getchar()));

Nevertheless, today (Sep 4, 1998) I found myself writing a function that read like this:

char *skip(char *s, int c){ while(*s!='\0' && *s++!=c); return s; }

c

s

I did this after examining several alternatives, like

char *skip(char *s, int c){ while(*s!='\0' && *s!=c) s++; if(*s==c) s++; return s; }

char *skip(char *s, int c){ for(;*s!='\0';s++) if(*s==c) return s+1; return s; }

*s==c

I wound up using the first version in spite of the side effect in the test because it struck me as less infelicitous than the others -- that doesn't mean I'm happy with it.

Another Problematic Fragment

/*%

for(;;){ c=getc(f); if(c==EOF){ /* error code goes here */ } if(c=='/'){ if((c=getc(f))=='*' && (c=getc(f))=='%') break; ungetc(c, f); /* might be another slash! */ } }

if

while((c=getc(f))!=EOF){ if(c=='/'){ c=getc(f); if(c=='*'){ c=getc(f); if(c=='%'){ /* code that reacts to finding /*% */ } else ungetc(c, f); } else ungetc(c, f); } } /* error code goes here */

It's obviously written inside out -- the code reads horizontally, not vertically.

I deliberately avoided side effects in the if tests, but in a moment of weakness retained while((c=getc(f))!=EOF) .

tests, but in a moment of weakness retained . I don't like that it behaves oddly if ungetc(EOF, f) does the wrong thing. The standard says that it has no effect -- a buggy stdio might push EOF&0xff back.

does the wrong thing. The standard says that it has no effect -- a buggy might push back. Besides, ungetc is a silly thing to do with a character that's just going to reappear in c immediately.

for(;;){ c=getc(f); MightBeSlash: if(c==EOF){ /* error code goes here */ } if(c!='/') continue; c=getc(f); if(c!='*') goto MightBeSlash; c=getc(f); if(c!='%') goto MightBeSlash; break; }

But surely this is no good. The goto s are fairly benign, but still grate. Most egregious is the loop whose last statement is break; !

I let this code sit for an hour (it was lunch time) and then rewrote it. I can't remember what I eventually used, but it was not very satisfying. Looking at the various versions today, the one that strikes me as best is the original, with all of its calls to getc inside the tests. (Of course, if I were writing this code professionally, I'd probably call fgets and strstr rather than paw through the file byte by byte.)

What does that say about all my preaching? Only that truly general principals are hard to come by, and ultimately every case is a different case. Now that I have two examples of side effects inside conditionals that I think read better than the alternatives, maybe I'd better think harder about what principles are involved.

What About goto ?

If you want to go somewhere,

goto is the best way to get there.

-Ken Thompson

How can I say that a program with two goto s is `a net improvement' over one with none? The problem with goto is that it promotes spaghetti -- when you're trying to understand a program, every time you see a label, you must look at the code at each goto that targets the label to see what the sequence of operations might be. This is the same problem that deep nesting causes -- code at any number of far away places might immediately precede the code at a label, and it is often difficult to keep in mind all the things that might have occurred when you reach the label.

During the structured programming era, goto was a major bugaboo, so much so that most programmers today, with little or no experience thinking about code that uses goto , nevertheless recoil in revulsion at its use.

The whole structured programming debate was started by Edsger W. Dijkstra's essay Go To Statement Considered Harmful, Communications of the ACM, Vol. 11, No. 3, March 1968, pp. 147-148. One of the oft-cited bolsters of the argument to eschew the goto was that it appeared to be very difficult to describe goto in terms of C. A. R. Hoare's system of axiomatic semantics. Hoare's system involved specifying a predicate that gives the initial assumptions (pre-conditions) about a program. For each statement type there is a rule of inference that describes how to derive a post-condition from the statement's pre-condition. For example, let's think about the statment while(B)S , where B is an arbitrary Boolean (with no side effects!) and S is any statement (that doesn't transfer control out of the loop!). Hoare writes P{S}Q to mean that if proposition P is true before S is executed, then you can infer that Q is true afterwards. So, Hoare's rule for the while statement is:

if P and B {S} P

then P {while(B) S} P and not B

S

P

B

while(B) S

P

B

P

As I said, people were fond of pointing out that goto was ``obviously'' hard to formulate in terms of Hoare's semantics. So imagine everyone's surprise when Hoare discovered the following simple rule:

First, for each label L in the program, we attach a predicate P L that is to be true whenever we reach L . Then, the goto rule is

P L {goto L;} false

goto

goto

false

We also need a rule that says what happens at a label. This is just

if P L {S} Q

then P L {L: S} Q

For me, the Hoare semantics is a good way of thinking about goto : a goto is a way of getting to somewhere that you can do the same things as where you are now. The main practical consequences of this view are that I think very carefully about whether the points at the label and at the goto are really semantically equivalent, and that I try to carefully choose labels to say what the corresponding predicate is. For example, in the loop in the previous section, the label MightBeSlash suggests that that we've read a character into c , but have not yet been checked to see whether or not it's a slash. (Of course, if you look at the code, you see that another possibility is that it might be an EOF, but I took the view that MightBeSlashOrEOF was an awfully long name for a label that's never mentioned more than 6 lines away from its definition. Maybe I should have called it ExamineNewCharacter or something like that.)