Isolating A Superbug How could it possibly be that concatenating preamble and output would always work with one strategy but not another?



Let's recap what has happened so far:

Compiling my program would occasionally produce error messages that seemed unrelated to the program. Compiling exactly the same program again would usually work. I thought I had traced the problem to one phase of the compiler behaving consistently. However, when I tried to reproduce the problem, I learned that the real problem was that the previous phase was sometimes producing incorrect output.

Finally, I discovered that the problem wasn't in the compiler at all! Concatenating a copy of the compiler output to a copy of a preamble file that the compiler always put at the beginning:

cp preamble result cat output >>result

would cause result occasionally to have random characters changed to '0' . This misbehavior happened even though

cp nullfile result cat output >>result

and

cat preamble output >result

both worked every time.

This last misbehavior was genuinely puzzling: How could it possibly be that concatenating preamble and output would always work with one strategy but not another? The version that worked used the cat command; the one that failed used both the cat and cp commands. Could something be wrong with the cp command? No, because the part of the output that the cp command copied was always correct; it was the part of the output copied by the cat command that was wrong.

So something in the cat command was misbehaving, and that misbehavior depended on the contents of the preamble file. I knew that an empty preamble file worked and the particular one I was using failed. What about other files? After some experimentation, I had a bunch of examples of files that worked and others that failed. What did the ones that failed have in common?

After staring at them for some time, and constructing files to test various hypotheses, I realized something important:

Every preamble file that caused a failure had an odd number of characters.

Moreover, even when the preamble had an odd number of characters,

cat preamble output >result

always worked. What could possibly cause this command to behave differently from

cp preamble output cat preamble >>output

Of course! The second one appends to a file. And how does appending work? By opening the file, seeking to the end, and writing. If preamble has an odd number of characters, then that seek will go to an odd offset, and the seek wouldn't happen at all in

cat preamble output >result

In other words, I had a new hypothesis: Seeking to an odd position in a file, then writing, occasionally caused spurious characters to appear.

This was an easy hypothesis to test: Write a program that creates a file, seeks to an odd position, and writes data to the file. Sure enough, once in a while, the file would pick up characters that didn't belong there. Finally, I had isolated this bug accurately enough that I could send it to the operating-system group.

This story contains several lessons that are not obvious at first glance:

You don't always need the source code for a program to isolate bugs in it.

Bugs in one program sometimes behave in ways that lead you to think they're in another program entirely.

If something is happening that appears to be impossible, what's wrong is your understanding of what is happening. So you need to correct your understanding.

Every time you can rule out part of a program as the source of your bug, you're that much closer to finding it.

Next week, I'll reveal how it is possible for an operating-system bug to cause such a bewildering symptom.