Announced during re:Invent 2016, AWS Step Functions is a service for creating state machines. It is the spiritual descendant of the not-so-simple-to-use Simple Workflow (SWF) service. It addressed many of its predecessor’s usability issues and made AWS Lambda the centerpiece. In this post, we will review all you need to know about AWS Step Functions, including a hands-on tutorial.

What is AWS Step Functions?

Step Functions is an orchestration service that allows you to model workflows as state machines. You design your state machine using a JSON-based specification language, and then you can start the execution of your state machine in three ways:

The Step Functions service manages the execution state, and either handles errors or performs retries as specified.

In most cases, each state in the machine invokes a Lambda function. You can also incorporate branching logic, perform tasks in parallel, or even create SWF-style activities to integrate with external systems.

A significant benefit of AWS Step Functions is the ability to wait an arbitrary amount of time between states. That is difficult to do in an elegant, cost-efficient way with AWS Lambda.

Step Functions also allows users to visualize the state machine at both design time and execution time.

For example, a pipeline to ingest a sizeable S3 file into DynamoDB might look something like this:

Sadly, it’s not yet possible to visually design workflows. Instead, you can use a JSON for design, and use the visualization tool to validate the design visually.

Azure Logic Apps and IBM Node-Red both offer the ability to design workflows visually. Hopefully, Step Functions will follow suit shortly.

When Should AWS Step Functions Be Used?

Step Functions charges based on the number of state transitions. At $25 per million state transitions, plus the cost of the Lambda invocations, it is a comparatively expensive service.

Considering that you can also use a single AWS Lambda function to implement workflows as code, when should you use Step Functions?

I typically reserve Step Functions for three types of workflows:

Business Critical Workflows

Consumers are happy to pay a premium to ensure expensive purchases such as a car or a house against unexpected failures. Similarly, engineers are glad to pay a premium for workflows that they want to succeed.

Good examples are payment and subscription flows—the things that earn money.

For these business-critical workflows, it makes sense to pay a little extra to have more flexibility around error handling and retries, to give the workflows the best chance to succeed.

Complex Workflows

For complex workflows that involve many different states and branching logic, the visual workflow is a robust design and diagnostic tool.

For example, an application support team can look at the workflow diagram for a running or completed execution and understand what happened. The group can intuitively understand the state of the system and how it got there without knowing the ins and outs of its implementation.

It is possible because the critical design decisions in the workflow have been lifted out of the code and made explicit in a visual format that anyone can follow.

Equally, if a product person reviews this diagram (or any other non-technical user), they would understand it without knowing how the underlying code works. It makes collaboration much more comfortable, and you can quickly identify misunderstandings when everyone is on the same page.

Long-Running AWS Workflows

For workflows that cannot complete within the five-minute execution limit for Lambda, you should also consider using Step Functions.

The AWS Lambda team discourages the use of recursive Lambda functions because it’s easy to get them wrong. In such cases, you should use an orchestration service like Step Functions.

You can put explicit branching checks in place and enforce timeouts at the workflow level. It helps prevent accidental infinite recursions.

How Does AWS Step Functions Work?

At the heart of Step Functions, state machines are the state definitions. They describe how to propagate inputs from one state to the next.

AWS States

A state is a way you tell the state machine to “do something.” Here are the seven types of AWS states you can have:

Task

Executes the AWS Lambda function which is identified by the Resource field. The output from the Lambda function is then passed on to the next state as input.

"TaskState": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:1234556788:function:hello-world", "Next": "NextState", "TimeoutSeconds": 300 }

One caveat to remember is that the default value of TimeoutSeconds is 60 if not specified. This would fail the state with a States.Timeout error after 60 seconds, even if the Lambda function is still running!

Equally, if the function itself times out before the TimeoutSeconds value, then Step Functions is not able to distinguish the timeout error from other types of errors.

It makes handling specific errors more difficult. It’s a good practice to always match TimeoutSeconds with the timeout setting for the function.

Pass

Passes input to output without doing any work.

Wait

Causes the state machine to wait before transitioning to the next state.

Succeed

Terminates the state machine successfully.

Fail

Terminates the state machine and marks it as a failure.

Choice

Adds branching logic to the state machine.

Parallel

Performs tasks in parallel.

Input and Output

When execution is started, a JSON describes its input. That input is bound to the symbol $ and passed on as the input to the first state in the state machine.

By default, the output of each state would be bound to $ and becomes the input of the next state. However, you can use the OutputPath field to bind the output from a state to a path on $ instead, preserving other fields on $.

For example, if the input to a Task state is the following:

{ “x”: 42 }

If the output of the Task state is 84, then specify OutputPath as $.y as follows:

"DoubleInput": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:1234556788:function:double", "Next": "NextState", "OutputPath": “$.y” }

The output of this Task state would become:

{ “x”: 42, “y”: 84 }

Similarly, if you don’t want to present the entire JSON object $ as input to a Lambda function, then you can also use InputPath to select parts of $.

The following gif illustrates how $ changes as it passes through a number of states. InputPath and OutputPath are used to carefully select values from $ as input to the Task states, binding the outputs to new fields.

Error Handling

You can specify how to retry a state by adding a Retry field to its definition.

"TaskState": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:1234556788:function:hello-world", "Next": "NextState", "Retry": [ { "ErrorEquals": [ "ErrorA", "ErrorB" ], "IntervalSeconds": 1, "BackoffRate": 2.0, "MaxAttempts": 2 }, { "ErrorEquals": [ "ErrorC" ], "IntervalSeconds": 5 } ] }

If this TaskState failed with ErrorA or ErrorB, the execution engine would retry the state with two more attempts. IntervalSeconds specifies the delay before the first retry attempt.

Subsequent retries would multiply the delay by BackoffRate. For example, with an IntervalSeconds of 1s and BackoffRate of 2.0, the delays between retries would be 1s, 2s, 4s, 8s, and so on.

If they are not specified, then the following default values would be used:

IntervalSeconds : 1s

: 1s BackoffRate : 2.0

: 2.0 MaxAttempts: 3

After the max number of retry attempts was exhausted, the execution would fail with the last error unless you add a Catch field.

"TaskState": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:1234556788:function:hello-world", "Next": "NextState", "Retry": [ { "ErrorEquals": [ "ErrorA", "ErrorB" ], "IntervalSeconds": 1, "BackoffRate": 2.0, "MaxAttempts": 2 }, { "ErrorEquals": [ "ErrorC" ], "IntervalSeconds": 5 } ], "Catch": [ { "ErrorEquals": [ "ErrorA", "ErrorB", "ErrorC" ], "Next": "RecoveryState" }, { "ErrorEquals": [ "States.ALL" ], "Next": "TerminateMachine" } ] }

Like the Retry field, you can specify how the state machine should handle different types of errors and what states it should transition to next. For the LAST catcher in the Catch array, you can also use the special States.ALL error type as a catch-all.

AWS Limitations

Like other AWS services, Step Functions has a long list of limits. Here are a few important ones:

Maximum execution time: one year

Maximum execution history retention time: 90 days

When you start an execution, the execution name must be unique in your AWS account and region for 90 days.

in your AWS account and region for 90 days. There are regional limits on API calls to Step Functions such as ListExecutions and ListStateMachines. These limits are generally very low and refill slowly, so be mindful of them when your system needs to make regular API calls to Step Functions.

Aside from these service limits, the biggest limitation with Step Functions is the fact that you can’t spawn concurrent Lambda invocations dynamically. Imagine a state machine that reads a CSV file in S3, and then, for each row, spawns a Lambda function to perform some processing.

This is currently not possible with the Parallel state in Step Functions because the number of parallel tasks has to be specified ahead of time.

Challenges to Monitoring and Debugging

Just like any serverless application based on AWS Lambda, Step Functions bring their own observability challenges.

Every state machine exposes a number of metrics in CloudWatch. They allow you to monitor the execution time and success rate of its executions and create alarms against failures.

In the Step Functions console, if you select one of your state machines, you can see the history of all recent executions and their statuses.

You can then drill into a particular execution to see what happened. The Visual Workflow pane shows the current progress (when the execution is still running) or outcome of the execution.

You can click a step to see the input, output, and exception details for that step.

Whilst there is a link to the CloudWatch Logs log group for the function, you still have to go and find the relevant log stream yourself.

The Execution event history pane displays a detailed history of all state transitions, including the timestamp (in UTC) and the relative elapsed time since the start of the execution. This is useful for identifying performance issues and slow running steps.

Whilst there is a link to the CloudWatch Logs log group for the function, you still have to go and find the relevant log stream yourself.

The Execution event history pane displays a detailed history of all state transitions, including the timestamp (in UTC) and the relative elapsed time since the start of the execution. This is useful for identifying performance issues and slow running steps.

When you have steps that are executed multiple times in a single state machine execution, the Visual workflow pane only shows you what happened the LAST time that step is executed. The event history, on the other hand, shows you each invocation.

You can expand the TaskStateEntered and TaskStateExited events to see the input and output of the state.

When a Lambda function error occurs, you can also expand the LambdaFunctionFailed event to see the error details.

AWS Step Functions Is an Isolated Ecosystem

While Step Functions offers many tools to help you with monitoring and debugging, the problem is that they exist in an isolated ecosystem.

Modern applications are comprised of many independently deployable services. All of them are working together to make things happen. My state machines are a part of that application.

As an engineer, I need a unified tool for monitoring all of these different services. It’s not helpful for me to have to jump between different tools and AWS consoles to collect the information I need to understand the end-to-end flow of data.

When trying to understand and debug the end-to-end flow of data, you also need to know what happened OUTSIDE the state machine.

How was the execution started? Where did the data originate from?

It’s for these reasons that I really like what the Epsagon guys are building. One of the nice features of their tool is the ability to link the Step Function executions with its upstream functions.

This enables you to see at a glance not only what happened inside the state machine execution, but also what happened before it.

Another option is to use Trace Search to look for specific events that include operations such as startExecution. You can experience it live in Epsagon’s live demo environment.

Passing Correlation IDs through Step Functions Executions

On my own blog, I have previously written about how you can capture and forward correlation IDs through various Lambda event sources such as API Gateway, SNS, and Kinesis data streams. As you might have noticed from earlier screenshots, we can apply the same technique with Step Functions.

Using a Middy middleware like this one, we can capture correlation IDs in the invocation input and include them in our logs. You can see more examples in this guide to error handling in AWS Lambda using wrappers.

If you don’t want to build your own mechanism for flowing correlation IDs through the Lambda functions in the AWS Step Functions machine, or not sure how to implement such a mechanism yourself, Epsagon can help you as it does this out of the box.

Another nice feature of the Epsagon product is that it shows you the logs for the relevant Lambda invocation. Which is a lot better than just taking you to the CloudWatch Logs log group!

You might want to check the serverless beginner considerations when getting started with AWS Step Functions. In addition, these 5 best uses cases are a great place to start. The Hitchhiker’s Guide to Serverless is another great resource.

Good luck!

Using Step Functions? Sign up for a free trial of Epsagon and troubleshoot them faster!

More Tutorials:

Webinar: Applied Observability for Kubernetes and Azure

Get Started with EFS File System Access for AWS Lambda

Getting started with Azure Kubernetes Service (AKS)