All micro services I developed on a FaaS platform like AWS Lambda had one aspect in common: The time limit of a common http request (30 seconds when using AWS API Gateway) was exceeded easily.

Therefore I had a design similar to this diagram:

Trigger: A short-lived AWS lambda function with an API Gateway event to start off processing / calling transform asynchronously. Transform: One or more functions that did the actual processing. Status: Another short-lived AWS lambda function with an API Gateway to query the status of process. Status Database: A DynamoDB table or ElastiCache Redis instance to hold the state of all processes.

For 4. Status Database the choice between Redis and DynamoDB was always quite clear: Do you want to keep the states after completion? Use DynamoDB.

highly trafficked serverless functions can get quite complicated. On the other hand, configuring a AWS ElastiCache Redis instance with serverless, requires a lot of boilerplate code. See this Gist.

Not to mention, that both solutions aren’t purely serverless, when you look at the scalability and pricing AWS offers there.

Additional side concerns where:

When you have file system reliant code in AWS Lambda, you want to enforce some re-try and backoff policies, since each function only gets 500MB of disk, which is shared with concurrent calls to the same function instance.

If multiple functions making up the transform part, error handling, parallelisation and synchronisation can introduce a lot of boilerplate code to each function.

Introducing AWS Step Functions

To summarise, or in case you skipped the first section — the problems we face with asynchronous serverless micro services:

Persisting the state in DynamoDB or AWS ElastiCache / Redis is costly and/or complex. Error handling, re-try/backoff behaviour and flow control require a lot of boilerplate code on a per function basis.

AWS sells Step Functions (short SFN) as a tool for building distributed applications with visual workflows. Meaning you can define an execution flow between different types of AWS services, foremost AWS Lambda.

I like to explain technologies by example, so let’s take a micro service that transforms vectorised PDFs to transparent PNGs. Let’s say for whatever reason we want to split up the transformation in two steps / two functions:

Convert: Converts the PDF to a PNG file. Transform: Makes the white background of the PNG file transparent.

Additionally we would like to generate a thumbnail of the PNG while it is transforming. So we end up with a very simple flowchart:

After you setup your serverless project and defined all your functions in the `serverless.yml` file, install two plugins:

Serverless Step Functions: To define the state machine in the serverless.yml. Serverless Pseudo Parameters: Required by serverless-step-functions to refer to the defined functions.

sls plugin install -n serverless-step-functions sls plugin install -n serverless-pseudo-parameters

These two commands should have added the following three lines at the bottom of your serverless.yml file:

plugins: - serverless-step-functions - serverless-pseudo-parameters

Let’s assume the functions are defined in serverless.yml as follows:

functions: convert: handler: handler.convert transform: handler: handler.transform thumbnail: handler: handler.thumbnail

To define the state machine as in our flowchart, we can apply the following YAML configuration after the functions section:

stepFunctions:

stateMachines:

pdfTransform:

name: PDFTransform

description: "Takes vectorised PDFs and transforms them to PNGs with transparentbackground, also generates thumbnails for them."

definition:

StartAt: Convert

States:

Convert:

Type: Task

Next: Processing

Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:my-service-${opt:stage}-convert

Processing:

Type: Parallel

End: true

Branches:

- StartAt: Transform

States:

Transform:

Type: Task

Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:my-service-${opt:stage}-transform

End: true

- StartAt: Thumbnail

States:

Thumbnail:

Type: Task

Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:my-service-${opt:stage}-thumbnail

End: true

Deploy your function:

sls deploy

Then open your AWS Management Console, select the proper region, open the Step Functions Menu and click on PDFTransform. You can execute your state machine here for testing:

Very well, our functions are deployed to AWS Lambda and are orchestrated via Step Functions.

Next we want to implement retry/backoff policies for our Convert function. For functions who suffer from contended file system space on AWS Lambda I usually make two retries with a backoff two times the average runtime.

Let’s say in the case of convert the average runtime is 30 seconds. To configure the interval we simply add a Retry clause to the task definition:

Convert:

Type: Task

Next: Processing

Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:my-service-${opt:stage}-convert

Retry:

- ErrorEquals:

- States.TaskFailed

IntervalSeconds: 30

MaxAttempts: 2

BackoffRate: 2

Fine! Next we add error handling: When Transform or Thumbnail fail, I want to make sure the converted PNG file gets deleted and an error is reported.

We can first define a new AWS Lambda function, we call Rollback:

functions: # ... rollback:

handler: handler.rollback

Second, define it as a new terminal state in our state machine:

Rollback:

Type: Task

Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:my-service-${opt:stage}-rollback

End: true

And last: Define this state as the next step when our Processing parallel task fails:

Processing:

Type: Parallel

End: true

Catch:

- ErrorEquals:

- States.TaskFailed

Next: Rollback

Branches:

# ...

After re-deploying and executing the state machine again: