State machines are a great way to design software but they can be difficult to implement well.To illustrate this I’ll develop a simple state machine then increase the complexity to demonstrate some of the difficulties

We’ve all washed dishes before - it’s easy isn’t it? Scrub, rinse, dry, scrub, rinse dry. Scrub the dish until all of the gunk is off of it, rinse until the soap is off, put it in the drying rack. If you want to design software to implement this you have options. You could use a flowchart (those are fun!) or you could use a state diagram. Both are equally valid methods of designing software - the flowchart is very procedural while the state diagram is… stateful. The software you write with either design method can implement the same functionality but in different ways. For this article, of course, I’m going to use a state diagram to design the software. Below is the state diagram that implements washing dishes:

This depiction is fairly typical for state machine documentation. Each circle represents a state - in this case, SCRUB, RINSE, DRY. The arrows represent transitions between states. When the conditions to exit a state are not met, the arrow loops around to the same state it’s in. Otherwise, it moves on to the next state. Each arrow is labeled with the condition that pertains to it. The arrow labeled ‘Initialization’ shows the transition from external control into the state machine, and the arrows from CLEANUP show exit conditions for the state machine.

If you want to implement this state machine as code (and if you’re reading this article I’m guessing the answer is Yes) you might end up with some code that looks like the code below:

typedef enum{ INIT, SCRUB, RINSE, DRY, CLEANUP } t_state_machine; enum { EXIT_NORMAL =0, EXIT_UNKNOWN, EXIT_LL_INIT_FAILURE, EXIT_INTERNAL_FAULT, EXIT_BATHROOM, NORMAL_OPERATION = 0x7F } static t_state_machine state = INIT; int exit_code = EXIT_UNKNOWN; //Defensive programming - assume failure interrupt bathroom_break(void){ exit_code = EXIT_BATHROOM; } void main(void) { wash_item * dish = NULL; /* Low-level initialization - * Configure hardware * Allocate memory * Etc. */ exit_code = ll_init(); //Either returns NORMAL_OPERATION or EXIT_LL_INIT_FAILURE while (exit_code == NORMAL_OPERATION) { //State-independent code //None //State-dependent code switch(state) { case INIT: get_towels(); fill_sink(): stack_dishes(); dish = pick_up_dish(); if(!dish) { state = CLEANUP; } else { state = SCRUB; } break; case SCRUB: while(*dish == DIRTY) { scrub(&dish); } state = RINSE; break; case RINSE: while(*dish == SOAPY) { rinse(&dish); } state = DRY; break; case DRY: while(*dish == WET) { dry(&dish); } put_away_dish(&dish); dish = pick_up_dish(); if(!dish) { state = CLEANUP; } else { state = SCRUB; } break; case CLEANUP: exit_code = EXIT_NORMAL; break; //Error Conditions - these should never happen, so shut down immediately. default: exit_code =EXIT_INTERNAL_FAULT; if(dish) { return_dish(&dish); //Unsure what state the dish is in, return it to the dirty pile } break; } } return exit_code; }

This is so high-level as to be almost pseudo-code so don’t take it too literally. It’s meant to give you an idea of what goes into the state machine portion of such a design rather than all of the ancillary functions (i.e., pick_up_dish, dry_dish, etc.).

There are some interesting aspects of this design which I’ll discuss the reasons behind.

Enumerated Type

First, I’d like to draw your attention to the t_state_machine enumerated type. Each state has an entry in the enumeration. In the main loop, the state variable is a variable of this enumerated type. Many state machines simply use an integer or an unsigned character as the variable type for the state machine. This is not incorrect, but it lacks the main advantage of an enumerated type: type safety. If I had used an unsigned character to hold the state I would only be using 6 of the possible 256 values that the variable could take on. In C, I could legally set the state variable to any of those values and the compiler wouldn’t care even though my application will only recognize those specific 6 values. Using an enumerated type, the compiler will warn me if I attempt to assign a value other than one of the enumeration values to the state. It’s not an absolute safety net but it helps to prevent silly errors.

It should be noted that there are several drawbacks to this approach. The first among them is variable size. Generally, in C enumerations are the same as the int type. On most compilers this is 32-bits wide although it does vary. If you use this trick you can’t explicitly define the width of the state variable. The tradeoff here is reduced control over the low-level details versus greater checks against invalid state values. Another potential disadvantage is reusability. If you want to pass the state to a function you’ll need to define the function prototype with the correct state variable type - the enumerated type specific to the state machine. This is fine if you only have one state machine in your design but when you add multiple different state machines to the design you’ll have to add multiple functions to handle each different state machine and its custom type. You might be able to avoid this by casting the custom types to an integer value but then you lose the type safety which was the driver behind the custom type in the first place. There are no perfect solutions to this problem in C (although C++ has better tools to handle this issue cleanly).

Main Loop/Low-Level Init

One of the first things that is done by the code is to call the ll_init function. In every application that you work with there will be some need for low-level initialization. In higher-level systems you may need to initialize frameworks you’re using or allocate memory for objects before general operation. In a low-level embedded application you will definitely need to initialize hardware, run built-in tests or do any number of checks to ensure you can proceed to normal operation. In this example, the ll_init function returns a status which is assigned to exit_code. If ll_init succeeds, NORMAL_OPERATION is returned and the state machine is entered. There are a number of possible exit codes - one specifies normal exit with no errors (EXIT_NORMAL), another that low-level initialization failed (EXIT_LL_INIT_FAILED) and so on. In a regular desktop-style program, error codes would be utilized by the calling program. The situation for embedded software is different - the error codes would have to be sent to a higher-level executive to be acted upon. Most likely this would be an external system, or at the very least the embedded system would be enter a permanent failure mode and the error code would be saved to non-volatile memory for future failure analysis.

Another notable aspect of the main loop is the space before the case statement for state-independent code. There currently isn’t any but we will be adding some shortly.

The implementation of this state machine is fairly basic. There is plenty of room for improvement.

Redundant Behavior

Redundant behavior is undesirable in software design - mainly because it’s a maintainability issue. If you’re doing the same thing in two places with two different (yet essentially identical) pieces of code if you want to change what you’re doing you have to change it in two places. What I’d like to draw your attention to two snippets of similar code in the STARTUP and DRY states. Both of those states try to determine whether there are any more dishes available to wash via a pick_up_dish call. This is the redundant behavior. The simplest way to remedy this particular redundant behavior is to make a new state which looks for a dish to wash: GET_DISH. This state will contain the only call to pick_up_dish. If there is a dish, the state machine will proceed to SCRUB and so on. If not, it will go straight to CLEANUP. This change will reduce the redundant code in the design and limit the amount of code that needs to be changed if the requirements for this software change.

Asynchronous Behavior

The implementation of this state machine is for the most part synchronous - everything that happens in the program happens in the main loop. This is not representative of the vast majority of embedded applications - typically you’ll have interrupts or other tasks that are running at the same time which can generate events that affect your state machine. Additionally, many of the techniques which apply to state machines used in asynchronous embedded programs also apply to multithreaded applications as well. As an example of asynchronous behavior I added an interrupt - bathroom_break. Whenever you need to go to the bathroom, the interrupt will execute and set the exit code to EXIT_BATHROOM which will break you out of the main loop and exit the program, letting the caller know that you have to go to the bathroom. However, there are several issues with the implementation of this state machine and asynchronous behavior.

Blocking

Take a look at the SCRUB, RINSE, and DRY states in the state machine. They all implement code similar to this:

while(*dish == DIRTY) { scrub(&dish); }

This is an example of blocking code - while the dish is still dirty, the dish washer will keep scrubbing and do nothing else. It doesn’t matter if it takes ten seconds or five years - until the dish is clean the washer won’t stop for anything - including bathroom breaks. Even if the bathroom_break interrupt executes and sets the exit_code the main loop won’t run until the SCRUB state stops blocking. The dish washer will dutifully hold his or her bladder until the dish is clean - possibly precipitating an embarrassing situation.

Blocking behavior is usually the most direct and simplest implementation of any given algorithm but it assumes that the code has one responsibility and one execution path. This is an acceptable assumption to make in some situations, but not often in embedded programming and particularly not in this example. Since this code is responsible for handling washing dishes and bathroom breaks non-blocking code should be used. What does non-blocking code look like? The one major rule you should follow is: no loops within loops. Since there is a main loop in this program, there should be no loops in any of the functions that are called during the main loop. This is a dramatic approach, certainly, and is more of an ideal than a hard and fast rule. You can usually achieve the same functionality without loops as you can with them but in some cases it may be preferable to sneak a for loop or a while loop inside one of your functions to preserve code clarity and simplicity. Exactly where the line should be drawn is a topic for another article.

In this particular example we can avoid blocking behavior rather simply - the superloop allows us to avoid looping within a state and just use an if...else like this:

if(*dish == DIRTY) { scrub(&dish); } else { state = RINSE; }

Every time the main loop executes and the dish is dirty it will scrub it. When it’s clean it will switch the state to RINSE. This code also allows the main loop to continually execute and check the exit_code just in case a bathroom break is needed - potentially saving a great amount of embarrassment.

Race Conditions

Asynchronous behavior is often plagued by race conditions - situations where one output is being modified by two different portions of code unpredictably. When a race conditions occurs you don’t know what state the output will be in. Since deterministic (reproducible) behavior is generally a goal for programming this is considered to be a bad thing. The race condition in this case is the setting of exit_code. Both the bathroom_break function and main loop set this variable, but the bathroom_break function could execute at any time since it’s an interrupt. This could lead to the following situation:

case CLEANUP: -------------bathroom_break executes--------------------- exit_code = EXIT_BATHROOM; -----------Main Loop resumes------------------------------ exit_code = EXIT_NORMAL; break;

The main loop could potentially overwrite the exit code to indicate a normal exit after the interrupt sets it to indicate a bathroom break is necessary. I don’t need to tell you what this would mean in real life - we all know it wouldn’t be pretty to forget that you have to go to the bathroom for any length of time.

Race conditions are an important issue to be aware of in software environments where code can execute asynchronously (typically with interrupts in embedded systems or in multithreaded environments) but race conditions are exacerbated when variables are shared between different execution contexts. With state machines it’s actually common to have race conditions with the state variable. Usually the state is a global variable which is set by complex logic spread out over multiple functions and multiple execution contexts. This makes it difficult to keep track of the places where the state is modified to ensure they don’t conflict and create a race condition. Ideally, important variables (such as the state variable) should be written to in one and only one place and communication between execution contexts should be well-structured to prevent race conditions. One good inter-task communication method to use would be a queue. The state variable would be fed by a queue that is posted to by multiple functions in different execution contexts, but the state variable itself would only ever be written to inside of the main loop, thus avoiding race conditions.

Resource Handling

Another issue with the bathroom break functionality is that it might exit the dish-washing state machine in the middle of washing a dish - without finishing cleaning the dish or even putting it down before going to the bathroom. Yes, it’s silly to think that a person would just stop washing dishes and visit the bathroom, plate in hand, to do his business but computers will only do what you tell them. This is the eternal bane of computer programmers. As odd as programmers might be, computers are much, much odder if you don’t program them correctly. The problem in this case is that the bathroom-break functionality is not properly synchronized to the general state-machine operation. Rather than simply executing the bathroom-break functionality independent of whatever is happening inside of the main state machine, it must be handled within the context of the main state machine. So, rather that simply setting the exit code and pretending that this will solve all primary and secondary concerns more complex operations are required.

Scoping

This state machine implementation utilizes global variables for state and exit_code. By themselves, there’s nothing wrong with global variables - provided they’re used properly and respected. In this example there is a good amount of respect: the state variable is written by only one function - main. The exit_code variable is not as well respected - it’s written by the interrupt and by the main loop. We’ve already seen how this can cause problems with race conditions and generally the problem with global variables don’t end there. While global variables are easy to use they are just as easy to abuse. It’s for this reason that it’s generally considered good programming practice to limit variables to the minimum scope necessary.

A Better Implementation

If we consider all of the criticisms of the initial state machine implementation (blocking behavior, race conditions, poor resource handling, scoping issues and redundant behavior) we can develop a much better one - seen below:

typedef enum{

INIT,

GET_DISH,

SCRUB,

RINSE,

DRY,

CLEANUP

} t_state_machine; enum {

EXIT_NORMAL =0,

EXIT_UNKNOWN,

EXIT_LL_INIT_FAILURE,

EXIT_INTERNAL_FAULT,

EXIT_BATHROOM,

NORMAL_OPERATION = 0x7F

} static volatile int bathroom_break_needed = 0; //Flag to indicate the bathroom_break interrupt executed. interrupt bathroom_break(void){

bathroom_break_needed =1;

} void main(void) { wash_item * dish = NULL;

int exit_code = EXIT_UNKNOWN; //Exit code - defensive programming - assume failure



/* Low-level initialization -

* Configure hardware

* Allocate memory

* Etc.

*/ exit_code = ll_init(); //Either returns NORMAL_OPERATION or EXIT_LL_INIT_FAILURE while (exit_code == NORMAL_OPERATION) {

static t_state_machine state = INIT; //State variable - limit scope to within the main loop



//State-independent code

if(bathroom_break_needed)

{

//Dispose of any dish currently being washed

if(dish) {

return_dish(&dish);

}

exit_code = EXIT_BATHROOM;

continue; //Do not execute the rest of the loop - return to the top to allow the loop to break

}



//State-dependent code

switch(state) {

case INIT:

get_towels();

fill_sink():

stack_dishes();

state = GET_DISH;

break;

case GET_DISH:

dish = pick_up_dish();



if(!dish) {

state = CLEANUP;

}

else

{

state = SCRUB;

}

break;

case SCRUB:

if(*dish == DIRTY) {

scrub(&dish);

}

else

{

state = RINSE;

}

break;

case RINSE:

if(*dish == SOAPY) {

rinse(&dish);

}

else

{

state = DRY;

}

break;

case DRY:

if(*dish == WET) {

dry(&dish);

}

else {

put_away_dish(&dish);

state = GET_DISH; } break;

case CLEANUP:

exit_code = EXIT_NORMAL;

break;

//Error Conditions - these should never happen, so shut down immediately.

default:

exit_code =EXIT_INTERNAL_FAULT;

if(dish) {

return_dish(&dish); //Unsure what state the dish is in, return it to the dirty pile

}

break;

}

}

return exit_code;

}

This implementation improves upon the the previous implementation in several ways:

The GET_DISH state has been added to replace the redundant behavior in the DRY and INIT states.

The bathroom break functionality has been modified to remove race conditions and improve resource handling. Now, the interrupt sets a flag which is checked in the state-independent portion of the main loop. If the flag is set any dish currently being washed will be put back in the stack, the exit_code will be set and the rest of the main loop skipped to prevent a race condition.

The state variable has been made a main-loop static local variable instead of a global variable. This is a particularly interesting feature in C: brackets (‘{‘ and ‘}’) are actually scope operators - any time you use brackets you can define variables that are only valid within those brackets. This prevents any code outside of the main loop from modifying the state variable - this makes for a very safe state machine implementation. However, this scoping also prevents any code outside the main loop from reading the variable either. This may not work well with your application, so use care when deciding on the scope for the state variable.

All of the blocking loops have been removed from within the states and replaced with if..else statements. The functionality of the code is the same, but the main loop will continue to execute and evaluate the state-independent code and the state-dependent code at the same time.



State machines can be useful in designing software but a good developer should remember that design is not implementation. There are a variety of ways a state machine can be implemented and some of the most basic ways can produce undesirable side-effects. With some simple strategies, however, these side-effects can be mitigated and the implementation of a state machine can more accurately reflect the intended design of the code.