Pilot fish is working for a client at the same time that a consulting firm is developing a new system for the support team. And fish is nearby when the job fails during one of the early production runs. The support staff looks it over, but is quickly lost, so fish is called on for help.

Fish finds that the system consists of nine programs that run in sequence within a single shell script. The final step had failed, resulting in a non-zero return code.



So fish reviews Program 9’s output, and finds that the file from Program 8 was missing. He and the support team lead then start looking through Program 8 and its log to determine why it didn’t produce the file — only to discover that the file from Program 7 was missing. They repeat through programs 6, 5, 4, 3, and finally back to Program 2, where they find the real cause of the failure: There was no error checking between the programs.

Fish explains that error checking at each stage would have saved hours, first in runtime, since the problem would have become known several steps before the job’s completion, and second in research time, because they wouldn’t have had to check all those same failed programs.

Feed the Shark! Send me your true tales of IT life at sharky@computerworld.com. You can also subscribe to the Daily Shark Newsletter.