At the heart of the engine is a state machine service aka Decider service. As the workflow events occur (e.g. task completion, failure etc.), Decider combines the workflow blueprint with the current state of the workflow, identifies the next state, and schedules appropriate tasks and/or updates the status of the workflow.

Decider works with a distributed queue to manage scheduled tasks. We have been using dyno-queues on top of Dynomite for managing distributed delayed queues. The queue recipe was open sourced earlier this year and here is the blog post.

Task Worker Implementation

Tasks, implemented by worker applications, communicate via the API layer. Workers achieve this by either implementing a REST endpoint that can be called by the orchestration engine or by implementing a polling loop that periodically checks for pending tasks. Workers are intended to be idempotent stateless functions. The polling model allows us to handle backpressure on the workers and provide auto-scalability based on the queue depth when possible. Conductor provides APIs to inspect the workload size for each worker that can be used to autoscale worker instances.

Worker communication with the engine

API Layer

The APIs are exposed over HTTP — using HTTP allows for ease of integration with different clients. However, adding another protocol (e.g. gRPC) should be possible and relatively straightforward.

Storage

We use Dynomite “as a storage engine” along with Elasticsearch for indexing the execution flows. The storage APIs are pluggable and can be adapted for various storage systems including traditional RDBMSs or Apache Cassandra like no-sql stores.

Key Concepts

Workflow Definition

Workflows are defined using a JSON based DSL. A workflow blueprint defines a series of tasks that needs be executed. Each of the tasks are either a control task (e.g. fork, join, decision, sub workflow, etc.) or a worker task. Workflow definitions are versioned providing flexibility in managing upgrades and migration.

An outline of a workflow definition:

{

"name": "workflow_name",

"description": "Description of workflow",

"version": 1,

"tasks": [

{

"name": "name_of_task",

"taskReferenceName": "ref_name_unique_within_blueprint",

"inputParameters": {

"movieId": "${workflow.input.movieId}",

"url": "${workflow.input.fileLocation}"

},

"type": "SIMPLE",

... (any other task specific parameters)

},

{}

...

],

"outputParameters": {

"encoded_url": "${encode.output.location}"

}

}

Task Definition

Each task’s behavior is controlled by its template known as task definition. A task definition provides control parameters for each task such as timeouts, retry policies etc. A task can be a worker task implemented by application or a system task that is executed by orchestration server. Conductor provides out of the box system tasks such as Decision, Fork, Join, Sub Workflows, and an SPI that allows plugging in custom system tasks. We have added support for HTTP tasks that facilitates making calls to REST services.

JSON snippet of a task definition:

{

"name": "encode_task",

"retryCount": 3,

"timeoutSeconds": 1200,

"inputKeys": [

"sourceRequestId",

"qcElementType"

],

"outputKeys": [

"state",

"skipped",

"result"

],

"timeoutPolicy": "TIME_OUT_WF",

"retryLogic": "FIXED",

"retryDelaySeconds": 600,

"responseTimeoutSeconds": 3600

}

Inputs / Outputs

Input to a task is a map with inputs coming as part of the workflow instantiation or output of some other task. Such configuration allows for routing inputs/outputs from workflow or other tasks as inputs to tasks that can then act upon it. For example, the output of an encoding task can be provided to a publish task as input to deploy to CDN.

JSON snippet for defining task inputs:

{

"name": "name_of_task",

"taskReferenceName": "ref_name_unique_within_blueprint",

"inputParameters": {

"movieId": "${workflow.input.movieId}",

"url": "${workflow.input.fileLocation}"

},

"type": "SIMPLE"

}

An Example

Let’s look at a very simple encode and deploy workflow:

There are a total of 3 worker tasks and a control task (Errors) involved:

Content Inspection: Checks the file at input location for correctness/completeness Encode: Generates a video encode Publish: Publishes to CDN

These three tasks are implemented by different workers which are polling for pending tasks using the task APIs. These are ideally idempotent tasks that operate on the input given to the task, performs work, and updates the status back.

As each task is completed, the Decider evaluates the state of the workflow instance against the blueprint (for the version corresponding to the workflow instance) and identifies the next set of tasks to be scheduled, or completes the workflow if all tasks are done.

UI

The UI is the primary mechanism of monitoring and troubleshooting workflow executions. The UI provides much needed visibility into the processes by allowing searches based on various parameters including input/output parameters, and provides a visual presentation of the blueprint, and paths it has taken, to better understand process flow execution. For each workflow instance, the UI provides details of each task execution with the following details:

Timestamps for when the task was scheduled, picked up by the worker and completed.

If the task has failed, the reason for failure.

Number of retry attempts

Host on which the task was executed.

Inputs provided to the task and output from the task upon completion.

Here’s a UI snippet from a kitchen sink workflow used to generate performance numbers: