Ed is an EE, PE, and author in Poughkeepsie, New York. You can contact him at [email protected]

"...I had grasped the significance of the silence of the dog, for one true inference invariably suggests other."

--Arthur Conan Doyle, The Memoirs of Sherlock Holmes: Silver Blaze

The traffic signal at a busy intersection between our home and the mall was flashing red in both directions and, as we approached the traffic snarl, I commented, "Looks like the opposing green detector caught a bug."

Exactly on cue, a small voice from the back seat asked, "Dad, what's an opposing green detector?"

I explained that if two signal lamps ever allow vehicles to enter the intersection from opposing directions, a monitoring circuit inside the controller begins flashing red lights in all directions. That condition must be cleared manually after a human figures out what went wrong.

"Oh, I get it," said the small voice. She's accustomed to our discussions of computer software and hardware problems, although we think that, for at least a while, she really did believe a "computer bug" had six or eight legs.

The opposing-traffic detector, as it's properly called, must be completely separate from the signal controller because, by definition, the controller should never allow opposing traffic into the intersection. If opposing greens or yellows light up, the signal controller has failed and cannot be allowed a vote on its own sanity.

None other than Alan Turing proved that a Turing Machine (and, thus, all the computers we're interested in these days) cannot decide whether a given program will halt or run continuously. By extension, no program can detect all errors or run-time faults in another program.

Turing's results apply to classical computing, where a program always produces the same result given the same inputs. Embedded systems live in a different universe where analog noise corrupts inputs, digital outputs get disconnected, and even the hardware fails unpredictably.

Sometimes the problem lies not in determining that a system has halted, but in figuring out when the system has gone nuts. Sound familiar? Let's see how it works.

Good Code Gone Bad

As you can tell by the ads in any magazine catering to programmers, there's money to be made in finding and preventing programming errors. Much of this activity centers on the C programming language, its lineal descendants, and its mutant offspring, because they allow truly high-risk programming behavior and they're the predominant languages in use today.

Even when written with the best of intentions, a small C-oid semantic slip can wreak amazingly complex, often long delayed, damage without triggering any syntactic alerts. Piling additional layers of three- and four-letter acronyms atop the base code can produce working programs faster while simultaneously obfuscating the system's logic and providing a rich compost that nurtures those "it fell in the cracks" errors. Trying to figure out which set of acronyms contributed to a given error reveals just how bad things have become, even without partial template specialization.

Languages developed to avoid C's shortcomings suffer from many of the same compost-in-the-cracks issues. We're finding that the complexity of our systems trumps our ability to hide that complexity in layers of Other People's Code, as errors can occur anywhere in the pile.

As a rule of thumb, though, such errors produce incorrect results rather than complete system failures. Most software testing, therefore, revolves around verifying program logic and establishing test coverage, with the underlying assumption that the language and hardware will perform as described by their specifications.

Pop Quiz: How often have you been sure you found a hardware problem? Essay: How did you feel after the hardware folks got done proving you wrong?

The success of stack-smashing attacks, where a deliberate buffer overrun plunks executable code or a bogus return address on the stack, shows how brittle those testing assumptions can be. At least until recent years, apps programmers never thought to verify a function's return address.

Similarly, while one Embedded Systems Conference speaker observed that nobody checks printf()'s return value, I'm certain even fewer programmers expect that value to be occasionally wrong.

What can you do when you can't trust anything? That's a situation embedded programmers have faced for decades, and some of their techniques are making it back to the mainstream.

Running in Place

A traffic-signal controller has both a straightforward job and easily decoded output values, which make an external sanity-checking circuit practical. That's not generally the case, so the only way to figure out if a system has produced an incorrect answer is to compare it with a known-good system. Because we lack a "known-good" system (if we had one, we'd use it instead of the "maybe-good" system), we must compare the outputs of three systems and assume any disagreement marks a failure in the odd system.

That's fine for life-support equipment where you can justify the cost of three systems doing the work of one, but many embedded projects have trouble justifying an additional 8 KB of internal program space in a single-chip microcontroller. We must then resort to various heuristics, a fancy word meaning "it doesn't work quite as well, but it's a whole bunch cheaper."

If you've witnessed the cursor suddenly stop moving on your desktop GUI of choice, you've experienced an effective crash-detection heuristic: When the outputs don't change, the system has died. Unfortunately, the actual video signal changes constantly even when the image on the display has stalled, so detecting a frozen cursor in a stream of pixels requires relatively complex logic.

Embedded applications absorb real-world inputs and emit real-world outputs, often in the form of an infinite loop that reads input ports, performs a computation, and writes the results to some output ports. Remember that 3/4 of the microprocessors sold for embedded applications have a 4- or 8-bit data path with a few KB of code and you'll realize why the complexities of multitasking, real-time operating systems lack relevancesmall processors run one big loop with few frills.

The simplest defense against a lockup involves toggling a single output bit each time the program goes around the main loop, then feeding that bit to a watchdog timer on the circuit board. The "silence of the dog" in Silver Blaze was caused by a familiar person, just as an electronic watchdog is silent when it receives a steady series of pulses from another circuit.

When those pulses cease, the watchdog circuit restarts the system in the hopes that things will run correctly the next time. The underlying heuristic, that a steady series of pulses means the program is executing correctly, can be wrong, but in general, the cessation of pulses from the main loop is a Bad Sign.

Quite often one of the chip's normal outputs will change state fairly regularly, in which case you don't need a dedicated watchdog bit. We're talking about systems designed with such severe hardware cost constraints that not dedicating a bit to an overhead function can be a make-or-break cost consideration.

What can go wrong with a simple system like that? Plenty! There's a lot more than just a bit, a chip, and a few lines of code, to the extent that I refer to the whole process as "shaking the dice to see if a better combination comes up."

Good Hardware Gone Bad

Embedded programmers must develop a deep distrust of the machinery, a healthy attitude for systems where a "software crash" can produce an audible and sometimes expensive crunch. The trick is to prevent a crash in the first place, minimize the consequences when one does happen, and get the system operating again as quickly as possible, all without human intervention.

Good (that is, paranoid) design and coding practices can go a long way toward the first two objectives. Achieving the third requires external hardware and periodic reality checks.

For example, you might think that you could use a timer-tick interrupt as a watchdog to save the cost of an external circuit. The interrupt handler checks to see if the main-line code is making progress and, if it isn't, jumps to the CPU's power-on reset vector.

That's better than nothing, but mimicking a power-on reset doesn't perform all the startup functions handled by the actual hardware during a real reset. I've been suckered into attempting that and can tell you right up front it's just not possible.

Worse, what happens when timer interrupts mysteriously become disabled? The main-line code probably keeps chugging along while the program as a whole stops functioning. Why would interrupts become disabled? That's why you need an external watchdogyou don't know!

Incidentally, watchdogs must reset the system rather than generate an interrupt. Because you can't tell why the system stopped running, it's best to restart it from scratch, rather than mess around with unjustified assumptions.

Verily: Hell hath no fury like that of an unjustified assumption.

Shortly after transistors stopped being really expensive, microcontrollers began including on-chip watchdog timers to reduce overall system costs. Because those transistors weren't really cheap yet, the first watchdogs used the microcontroller's crystal oscillator to measure time. That worked great, right up to the point where the oscillator stopped working, at which point both the processor and the watchdog froze. Whoops! Recent microcontrollers generally include a dedicated watchdog oscillator.

Although resetting a CPU might not restart the oscillator, the watchdog can also send a failure signal to an external device, a front-panel indicator, or flip a traffic controller's lamps into all-red flashing mode. Of course, an on-chip watchdog needs a pin to get the failure signal off the chip, which negates most of the justification for squeezing everything on a single chip in the first place.

Scary enough?

Failing in Place

Whether failures arise due to intermittent hardware glitches or software errors, shaking the dice generally does come up with a better combination. You can get reasonable performance from a system that occasionally starts afresh with a clean reset, at least when problems don't occur too often.

However, even a clean reset can cause problems. All programs perform one-time housekeeping chores after a hard reset, and embedded systems may go through extensive hardware checks before starting normal operations. Early watchdogs had a fixed timeout interval, generally chosen to be the longest time a system could remain jammed, which could be too short for the startup code.

Modern watchdogs have either a selectable rate that allows a protracted startup or remain idle until they see a signal transition. As long as the startup code pulses the watchdog output before turning control over to the main program or operating system, either will work quite well.

All parts of the code must be aware of the watchdog timer and ensure that it's always receiving the attention it demands, which can pose a real problem during hardware testing. For example, what happens when the code stops at a debugging breakpoint? Yup, the system resets shortly thereafter.

Pop Quiz: Would you trust a system with a watchdog timer that could be disabled during operation? Essay: Discuss the virtues and faults of old-fashioned hardware jumpers.

Some systems cannot tolerate the response gaps caused by sporadic resets. In those situations, you must take other measures to ensure continued servicemeasures that don't include simply omitting the watchdog. Remember, the watchdog does not cause the glitches, it's a workaround we use because we can't (or won't) fix them!

A watchdog timer can simplify your error-handling code if used properly. When you detect a failure, write some tracking information into nonvolatile memory, and enter a spin loop that doesn't send pulses to the watchdog (pronounced "falling on the ball"). After the system resets, your power-on diagnostics should verify that all the hardware is functional and, if possible, report the error to a responsible party.

In any event, make sure that your code knows what to do after finding an error. Simply falling on the ball won't solve the long-term reliability problems we face, even if it can prevent the system from causing further harm.

So, if you're building embedded code, think about what might make the dog bark in the night. Your product will be the better for it.

Contact Release

Long ago in a universe far away, one of my EE professors assigned our class the problem of detecting when a signal controller permitted opposing-traffic flow. We had to consider both green and yellow lamps, turn-lane arrows, pedestrian controls, and so forth. I recall that we did okay, but the design was a wonder to behold.

Find more about traffic intersection design at http://www.urbandalelibrary.org/Urban_Design_Standards/chapter13.pdf. The New York State specification for traffic-signal controllers lives at http://www.dot.state.ny.us/traffic/files/dotmes.pdf. Start at page III.1.11 for the opposing-traffic detection logic.

The Halting Problem is described in more detail at http://www.wikipedia.org/wiki/Halting_problem, with a lighter view at http://www.netfunny.com/rhf/jokes/new89/halting.760.html.

Maxim/Dallas offers a line of hardware watchdog circuits at http://www.maxim-ic.com/MaximProducts/uPSupervisors/ WatchdogTimers.htm. If you think watch dogs aren't enough fun, check out the Grenade Timer at http://www-lce.eng.cam.ac.uk/publications/files/tr.2000.8.pdf.

I have the highest regard for the folks at Gimpel Software and have used PC-Lint on many occasions. Just reading their errata notices, though, gives you an image of late nights and serious caffeine dependency. See http://www.gimpel.com/html/pub80/ bugfix80 for reasons why (I think) C and its ilk have gotten too complex for words.

Review a classic mechanism to detect stack smashing at http://www.cse.ogi.edu/DISC/ projects/immunix/StackGuard/usenixsc98_htm/ paper.html. That it's both classic and we still have stack problems says something sad, doesn't it?

The June 2003 Scientific American covers Self-Repairing Computers at http://www.sciam.com, with a list of web-server failure statistics. Everything old seems to be new again, indeed.

The history of NASA's computer use up through the mid-1980s is at http://www.hq.nasa.gov/office/pao/History/computers/ contents.html. Take note of the discussion of the Shuttle's avionics in Chapter Four.

Alert reader Rene Tschaggelar reports that I overstated the death rates from cancer and heart disease by a factor of two in the June column. It's actually worse than that: I misinterpreted the date range in the CDC text files as being from 1999 up to 2000, but it actually includes both 1999 and 2000. If you divide everything by two, you'll come out about right and the numbers remain staggeringly high even in the face of a classic iteration-counting error. I have been reset and restarted.

DDJ