</scorpion>: FAIL

Anyone out there see the TV pilot of Scorpion? Genius hacker squad meets Homeland Security in a fast-paced thriller to save hundreds of airplanes from crashing after LAX air traffic control software upgrade fails and they didn’t save a backup of the old version (ZOMG!!!) so thousands of people are going to die because the planes… well, they just can’t land! They just can’t. Even if the weather is sunny and calm and there could quite possibly be some good old fashioned radios or other nearby air traffic control centers to help guide them down. Can’t land. Not unless the genius hacker squad can find a copy of the old software. So many plot holes, so little time....

Anyway despite all the technical inaccuracies, I was somewhat entertained. The hackers would try one solution and it would almost work… except they made a mistake, and their approach failed, so they had to think of something else. Then they’d try that and it would almost work, but success would elude their grasp… lather, rinse, repeat. This part resonated with my experiences as an engineer, especially recently.

Mistakes and Round-Trip Delay

I work with my manager and two other engineers in Arizona and two engineers in Romania. For seven months out of the year, we have a ten hour time zone difference. This makes for some interesting issues when we work on high-priority projects together. We have a narrow window of real-time communication, only an hour or two at the most, and that’s only because when we do work closely together, we get started at 7:00am or 7:30am (5:00pm or 5:30pm their time) and they work until 6:30pm or 7:00pm (8:30am or 9:00am our time). And that doesn’t allow for a lot of iteration. Typically it’s enough for us to exchange critical information, come up with any new strategies, and say Good Evening / Good Day.

So our progress becomes discretized. If we can do work independently, it’s potentially beneficial, since they can put a day’s work in, tell us what happened, and then we put a day’s work in. But if we get stuck and have to wait for each other, progress stops until the next day. Recently I’ve been working on some technology transfer with them: we have some complex test scripts to get things done, and last week these were ready (we thought) to send them to Romania so they could reproduce our work and investigate some other things. And at first I didn’t hear back anything, except that one of them was working on it, so everything (we thought) was fine.

Then on Monday I got an email message saying there were some errors. I don’t wake up early every day, and Monday was one of those days that I didn’t get into the office until 8:00am. So I scrambled to see if I could figure out what was going on, and asked some questions while the guys from Romania were still around. I got a little information before they left, and later I wrote them some suggestions. I set up a conference call for Tuesday at 7am.

The next day we worked intensely and discovered a few problems and found solutions for all of them, and things seemed to work well.

On Wednesday I woke up early, but forgot to check my email inbox until about 7:30am, and there was a message saying they ran into some new problems. I managed to start a conference call with them by 8:00am and we got things working by 9:00am.

As I’m writing this now, it’s Friday and their setup is mostly working, but we’ve still been helping them with minor glitches. What should have taken a couple of hours if they were on-site has taken almost two weeks, because (a) we’ve each made mistakes, and (b) we have this nasty round-trip delay where, unless we happen to be talking to them during that narrow window in our early morning and their early evening, it takes 24 hours to complete one cycle of communication.

Everybdoy Everybody Makes Mistakes

The human brain is notoriously unreliable. We forget things. We make unreasonable assumptions. We make arithmetic errors. We are overconfident. We underestimate risks. We mis-prioritize. And so on.

But that’s okay, for the most part. Because when we make errors we usually catch them soon afterwards, and we learn from them. Sometimes we catch the errors so quickly we don’t even notice them. So instead of obsessing over being 100% correct, we just deal with errors as a normal occurrence. Life is fairly fault-tolerant.

Except when it’s not. In the situation with our Romanian colleagues, the effects of errors were magnified by the round-trip delay. In other cases, the consequences of errors can be very high, especially if they involve nuclear power or military actions or large-scale financial decisions. And in these kinds of situations we have to be more careful.

Expect Errors: Systematic Approaches for Dealing With Failure

I talked to a few acquaintances who have served in the U.S. Armed Forces, figuring the Army or Navy knew how to deal with keeping errors from spiraling out of control. And they told me a few things that were helpful.

One thing that everyone mentioned was the buddy system: instead of working alone, servicemen generally work in pairs, which has the advantage of catching and correcting errors more quickly.

Aside from people failing, the military also expects components to fail, so they study and understand failure rates, and make heavy utilization of redundancy.

The military has a practice of holding after-action reviews to analyze situations and figure out how to prevent failures from reoccurring and improve their operations.

They also have a way of systemizing just about everything through standard procedures.

Preventing Mstiakes Mistakes

So what should we be doing? All of these methods are effective in institutional settings where procedures dominate and day-to-day business is essentially the same. In a research and development setting, every day is different. And the extra costs of redundancy are hard to justify in businesses where the goal is to get something out into the market first. So what works for the US Army won’t work for engineering startups or even a well-established engineering company.

Lessons from Reliability Engineering

We can learn some additional lessons from the techniques used to assess reliability. One of these techniques is the Failure Mode and Effects Analysis (FMEA). I’m not going to give a full explanation of an FMEA, but one of the tasks involves enumerating foreseeable failure modes and assessing them in three ways: Probability, Severity, and Detectability. Failures that are common, have severe consequences, and are hard to detect are the most urgent indicators that we have a design problem. Failures that are extremely rare, have mild consequences, and are easy to detect are usually not a concern. These represent three (mostly) independent axes that contribute to the risk assessment of a particular failure mode.

The FMEA is a well-established tool for improving reliability of product design. We can learn from the FMEA methodology and apply it to engineering processes as well.

Probability — If a particular type of mistake is common, that’s a liability. I talked in an earlier article about garden rakes:

A little while ago, I wrote about what I call the “garden rakes” syndrome in software, where there are little bugs or pitfalls lying around like sloppy garden rakes that no one has put away, and when you use these software programs, instead of zooming around getting things done, you’re either tripping over the garden rakes or carefully trying to avoid them. Either way, you lose focus on what you’re really trying to work on, and that causes a big hit in productivity.

Fix the garden rakes, especially if they are easy fixes! Even though they might not be particularly bad problems, they erode your ability to be reliable, and your colleagues may lose confidence in a solution you propose and reject it.

Severity — Think about the consequences of your actions, and anticipate failures. If something really bad can happen, even if it’s unlikely, make sure your colleagues are aware of it. You may need to come up with a plan in advance for dealing with the consequences.

Detectability — Silent problems are not good, because they tend to compound. Reliability theory says that if failures are detectable, then the likelihood of more than one happening at a time is remote, whereas if they are undetectable, then they can accumulate until they combine to form a more serious outcome. Again, make sure your colleagues are aware of things they need to know, so that problems can be addressed immediately. One good practice in software design is the creation of log files. There are many good logging libraries out there. Use them — if your software creates a step-by-step record of what it’s doing, this is immensely valuable in making any problems easier to diagnose. Similarly, you can benefit from keeping accurate records in a notebook of what you’re doing, so you can pinpoint specific problems after the fact.

High Reliability Organizations

A few years ago, I posted a list of suggestions for Organizational Reliability. Now’s the time to pull out the list again and emphasize some of these items, as well as making some new ones.

Treat time as your most precious resource. Do not squander others’ time. This is a big one. Think about leverage here. When you have a choice of taking ten minutes extra to test or document something, think about the cost to others if you don’t take that ten minutes. If you’re working with 10 other people, maybe it will take them 2 minutes each to work around a problem, because it’s an easy problem to recognize. Maybe it will take them an hour to work around the problem, because they’re unfamiliar with the issue. Either way, the aggregate cost of their time exceeds the amount of time it would have taken you to document an issue. When you engage in a real-time meeting with others, that’s when things become really precious. It may be difficult to schedule all the people in the room together. Or you may be dealing with a time-zone constraint, where there’s only a limited window of opportunity. If you have a 2-hour window (120 minutes) and you’re five minutes late, you’ve just wasted 4% of that time. So be punctual. Or if you need to make an audiovisual presentation, make sure everything is ready so you can start on time. Finally, when it’s your turn to present information, be concise. Maybe you’ve just spent a week learning how to complete a task. If you’re presenting that information, don’t give your colleagues a full brain dump. What they need is an awareness of a few highlights and the key concepts involved. Take the time to create a short and clear summary. If they need to know more details, they’ll ask.

Be clear and explicit. (Paraphrased from a number of items on the list). When you are presenting information, be very clear about it. Don’t just say “the software didn’t work.” Take the time to mention exactly which software you were using, and what details didn’t work. If your software has a log file, send that log file along with your bug report. If you have an issue tracking system, like JIRA or Redmine or Bugzilla, use it to create a detailed report. If you don’t have an issue tracking system, make it a priority to get one. If you are reporting details of what did work, be very clear about the procedures you are using, so that others can repeat it. Test scripts are more helpful than interactive work, since you can run a test script repeatedly to get the same results. And it’s almost always useful to gather data in a graphical form that lets your team visualize what is really going on. Saying “the controller wasn’t very stable” isn’t as helpful as recording a graph of control system inputs and outputs, and showing an underdamped system response. All of this takes time! So plan ahead and allocate time to record your work and be clear about it. Quick actions to try something can be helpful at detecting problems quickly, but a single anecdote isn’t really enough evidence to change your team’s approach.

Avoid making assumptions. This is the corollary to being clear and explicit. On the receiving end, do not make assumptions about the information you get. If any of the details are vague, don’t assume you know what the sender meant. Ask clarifying questions.

Don’t rely on email to maintain important data. Email is a great notification tool. We can write something up and bam! a minute later it gets sent to 73 of our closest colleagues. But email is not a good way to retain important information. Many organizations have legal requirements on their email retention policy to delete information by default after some period of time. Today’s email software still doesn’t really provide very good organization and retrieval features. If you’re sending out an important report by email, think of what you’re imposing on the recipients, especially if it’s followed up by discussion over email. Do you really expect them to maintain copies of all of this information, and be able to find it again? It’s much better to publish the report, in a place internal to your organization, where it can be easily located a year from now. Then use email to let people know about it — but expect that this email message will be deleted. If you’re not writing a report, but instead are working on diagnosing a problem, open an issue in the issue tracking system and keep your discussions there instead of email.

The original copy of the Organizational Reliability list mentioned high-reliability organizations, in lowercase. In researching this article, I found that the High Reliability Organization (HRO) is actually a well-established concept. (It even has its own website.) I also found an article from a hospital on the subject, which talks in depth about these five critical traits of the HRO, and how to encourage institutional participation:

Preoccupation with failure

Reluctance to simplify interpretations

Sensitivity to operations

Commitment to resilience

Deference to expertise

While this is really targeted at critical operations like aviation, nuclear power, and hospitals, I think as engineers we can learn these concepts and improve our reliability, by applying some of them to our ongoing activities.

Mistakes and Embedded Systems Design

There are some specific areas that are relevant to us as embedded systems designers or programmers.

Learn from your mistakes. One of the things I learned fairly early in my career is that certain component packages lend themselves to design or assembly mistakes. For example, resistors and unpolarized capacitors can be inserted either direction and it doesn’t matter. But two-terminal diodes and electrolytic capacitors have a polarity. These are prone to assembly errors, so both circuit boards and components need to have clear indicators of which orientation is the correct one. On the other hand, pay particular attention to three-terminal transistor packages like the SOT-23. These are not difficult to assemble correctly, but they are prone to being designed incorrectly when you are translating from datasheets using lettered pins (EBC or SGD) rather than numbers. There appears to be a standard pinout for the SOT23 (pin 3 stands alone; from the top view, the pins are numbered 1, 2, 3 in counterclockwise order) — although I could swear that 15 years ago, different manufacturers were not consistent in SOT23 pin numbering, so we had to be very careful when reviewing our pinouts. In any case, if you become aware that a type of mistake can easily occur, be systematic about looking for it in design reviews.

Use a revision control system! Software development without revision control is a huge risk, and if you see it, raise the issue quickly. Even if you’re just prototyping, revision control systems give you an easy way of recording exactly which software you are using for a particular test.

Event logging — this is a tough one in the embedded world. Desktop and server PCs have the luxury of a filesystem, and can easily create detailed event logs for later analysis. The 50-cent embedded processor doesn’t have a filesystem, which means you’re stuck either not having event logs, or having to cobble your own event log system together from what little resources you have available. I don’t have a good solution; the easiest thing to do is probably just spit data out of a UART pin, but then you’re only going to have an event log recorded when your device is actually hooked up to some external data recorder, whether it’s a special-purpose device or a PC.

Interface specifications – We deal a lot with analog-to-digital conversion. I’m surprised how many systems I’ve worked on that didn’t have a specification for input signals. Yeah, the design information was all there: if you look at the schematic, it tells you which resistors are used in a voltage divider or an amplifier, and it tells you what the ADC reference voltage is, and if you look at the software it tells you what multiplication factors are used. But this information is so critical! It should really be in an interface specification or a design specification: batteryVoltage has a 36V fullscale value and is encoded as a Q15 signed integer, motorCurrent has a ±2.5A fullscale value and is encoded as a Q15 signed integer. When you’re debugging your software, it’s not really useful to look at a program variable and see that motorCurrent is 12946 counts; you need to convert it to 0.9877A in order to be able to understand whether that value seems correct or not.

Wrapup and Feedbcak Feedback

You will make mistakes! There’s no getting around it. You aren’t perfect, and neither are your coworkers. And when you work together, sometimes you will amplify each other’s mistakes. So try to be as systematic as you can. Anticipate failure, and take the extra effort to root out failure from your engineering development efforts.

High Reliability Organizations like the military, aviation, hospitals, and the nuclear power industry, have processes in place to help reduce the chances of failure. Learn from them and apply these techniques to your team.

In the fast-paced R&D world, it’s difficult to stay agile, while also taking the extra effort to be reliable. I’d appreciate any feedback you might have: how has your engineering team been able to eliminate mistakes? What techniques have you used to work more reliably together, without sacrificing the speed of development? Let me know!

© 2014 Jason M. Sachs, all rights reserved.