If I were specifying a word count program in natural language, I might think of this series of steps:

Read standard input (fd 0) into some kind of input buffer. Iterate over that buffer, checking for breaks (a.k.a. whitespace) between words.

A string like the following might prove tricky enough (feel free to replace the reference to “Satan” with your preferred $EVIL_DEITY):

Every good dog goes to heaven – and if not? – well, I hear

Satan has de-

licious bones to chew!

By inspection, the sentence above has eighteen words. (Note that the implementation of wc in scsh described here (eventually) says 19, and GNU’s wc says there are 21–both are incorrect, but more on that below.) There are several edge cases to note in this sentence:

A space at the beginning of the sentence. A clause inside em-dashes. A hyphenated word occurring at the end of a line.

Some things we should probably do, given the above:

Ignore spaces at the beginnings of lines. Ignore punctuation in general, such as periods, exclamation points… ..except hyphens. Hyphens join two or more words into one. This should work across newlines; that would take care of the “hyphenated word at line’s end” case.

Of the above 3 items, the last point sounds the trickiest. I think we need to take care of the hyphen case eventually, but for now let’s punt on it and worry about getting something basic working. Starting at a high level, here’s my take on the program flow: