AWS Step Functions is great for solving problems that have sequential workflows and it has direct integrations with several AWS services such as ECS, Lambda, AWS Batch and many more.

Step Functions provides a reliable way to coordinate components and step through the functions of your application. Step Functions offers a graphical console to visualize the components of your application as a series of steps. It automatically triggers and tracks each step, and retries when there are errors, so your application executes in order and as expected, every time.

Problem Domain

One thing which Step Functions currently lacks is the ability to fan-out a particular task in the workflow. There are a couple of patterns which solve this by using a DynamoDB table to keep track of the activities.

For our use case, we also faced a similar problem where we had a couple of tasks that needed to be fanned out while still maintaining their sequential order. We didn’t really want to build a custom solution to keep track of these independent tasks unless absolutely necessary.

Primary design goals

Fig 1: Problem Breakdown

Ability to fan-out the steps within the processing of single workflow execution.

Ability to dynamically control the fan-out based on some input parameter.

Make most of these components stateless (enabling horizontal scalability and easy replacement of instances) by externalizing all-state by using databases, queues, and streams, etc.

Maximize the use of AWS technologies for orchestration, workflow and state management.

After a couple of hours of research and reading through AWS architecture blogs, I learned about AWS Batch. It seemed like the right fit for solving our problem.

AWS Batch is a fully managed service that allows to you run/schedule jobs and even has a concept of job dependencies. AWS Batch also has direct integration with Step functions.

So far, it was checking off all the boxes.

Going through AWS Batch blogs, I learned that the AWS team has been making a lot of improvements and it is now suitable to run tasks which last for only a couple of seconds.

Unfortunately, there aren’t really well-documented use-cases for AWS Batch and it has been perceived only useful for long-running scheduled background tasks.

Solution Strategy

We decided to use AWS Batch in conjunction with AWS Step Functions. This combination is suitable because :

It supports the capability to fan-out individual steps along with support for job dependency models. It achieves this by introducing Array Jobs where each parent job can have up to 10K child jobs. On top of that, it also supports an N_N job dependency type for array jobs so that each index child of this job must wait for the corresponding index child of each dependency to complete before it can begin.

job dependency type for array jobs so that each index child of this job must wait for the corresponding index child of each dependency to complete before it can begin. AWS Batch supports dynamic compute provisioning and scaling, so we could easily support 1K jobs or just 10 jobs without paying for idle time for the resources. This is achieved by configuring Compute Environments.

AWS Batch has a concept of Job Definitions, which allows us to configure the Batch jobs. While each job must reference a job definition, many of the parameters that are specified in the job definition can be overridden at runtime.

AWS Step Functions has direct, seamless integration with AWS Batch and supports both async and sync Batch job execution. This alleviates the need to create a custom solution to track the state of the jobs and their completion. We can just rely on the final parent batch job to be complete which consists of n child jobs.

We can easily support the concept of Critical Jobs in the future by making use of AWS Batch’s support for Job Queue priorities.

Solution Implementation

Step Function Implementation

Fig 2: Step Function State Machine

With the ArrayProperties parameters, we are able to submit n jobs where n is dynamically configured by the lambda in Step 1.

With Step Functions, you can submit an AWS Batch job for either synchronous or asynchronous execution. We utilize this to submit N asynchronous jobs in Steps 2 & 3 and N synchronous jobs in Step 4.

By using the Depends On: Type as N-N, Step 4 waits until all of the jobs submitted in Steps 2, 3 and 4 have completed.

Sample State Machine

{

"StartAt": "StepOneChunker",

"States": {

"StepOneChunker": {

"Type": "Task",

"Resource": "arn:aws:states:somelambda",

"ResultPath": "$.numChunks",

"Next": "AsyncStepTwoBatchJob"

},

"AsyncStepTwoBatchJob": {

"Type": "Task",

"Resource": "arn:aws:states:::batch:submitJob",

"ResultPath": "$.taskresult.jobDefinition.jobBatchInfo",

"Parameters": {

"JobDefinition": "job-two",

"ArrayProperties": {

"Size.$": "$.numChunks"

},

"JobName.$": "$.jobname",

"JobQueue": "job-queue-two",

"ContainerOverrides": {

"Environment": [{

// passing an environment variable

"Name": "SOME_VALUE",

"Value.$": "$.numChunks"

}]

}

},

"Next": "AsyncStepThreeBatchJob"

},

"AsyncStepThreeBatchJob": {

"Type": "Task",

"ResultPath": "$.taskresult.jobDefinition.jobBatchInfo",

"Resource": "arn:aws:states:::batch:submitJob",

"Parameters": {

"JobDefinition": "job-three",

"DependsOn": [{

"JobId.$": "$.taskresult.jobDefinition.jobBatchInfo.JobId",

"Type": "N_TO_N"

}],

"ArrayProperties": {

"Size.$": "$.numChunks"

},

"JobName.$": "$.jobname",

"JobQueue": "job-queue-3",

"ContainerOverrides": {

"Environment": [{

"Name": "SOME_VAL",

"Value.$": "$.type"

}]

}

},

"Next": "SyncStep4BatchJob"

},

"SyncStep4BatchJob": {

"Type": "Task",

"ResultPath": "$.taskresult.jobDefinition.jobBatchInfo",

"Resource": "arn:aws:states:::batch:submitJob.sync",

"Parameters": {

"JobDefinition": "job-four",

"DependsOn": [{

"JobId.$": "$.taskresult.jobDefinition.jobBatchInfo.JobId",

"Type": "N_TO_N"

}],

"ArrayProperties": {

"Size.$": "$.numChunks"

},

"JobName.$": "$.jobname",

"JobQueue": "job-queue-4",

"ContainerOverrides": {

"Environment": [{

"Name": "SOME_VAL",

"Value.$": "$.type"

}]

}

},

"TimeoutSeconds": 1200,

"Next": "StepFiveFinalizer"

},

"StepFiveFinalizer": {

"Type": "Task",

"Resource": "arn:aws:states:us-east-1:anotherlambda",

"ResultPath": "$",

"End": true

}

}

}

AWS Batch Implementation

The AWS Batch Job Definitions, Compute Environment, and Job Queues were pretty basic and just needed to be referenced in the Step Functions state machine correctly.