The Order Management System (OMS) at Jet is responsible for a number of business functions and was originally developed using a collection of microservices orchestrating tasks. As the company grew, the challenges with this architecture also grew until they decided to build a new workflow-based platform. In a blog post, James Novino, engineer at Jet, describes the challenges with their old system, an overview of the new platform, and their experiences after running it for just over a year.

The OMS originally ran on microservices using a combination of pub/sub, event sourcing, and other technologies. Each service was implemented using the same boilerplate with three sequential steps:

Decode. Reads a domain event from an incoming stream and transforms the event into an input type

Handle. Performing checks on the input and retrieving any data required

Interpret. Performing side effects

As the company and the requirements grew, the complexity of the architecture increased, making it harder to maintain the system. The number of services also increased, and since features often were distributed across multiple service, this led to longer development cycles. Novino sees this as an inherently complex process requiring a lot of boilerplate. He also points out that the complexity of building and maintaining this architecture caused negative effects on both system and team as the system grew.

They therefore started to create a new platform capable of handling all their business workflows in a more efficient way. They decided to design and build a workflow-based system, heavily inspired by Pat Helland and his paper: Life Beyond Distributed Transactions. The core design of the new system is based on two guarantees:

Idempotency, to avoid duplicate events

Consistency. They support multiple different backing stores but since they must be able to read their own writes, a store always implements a strong consistency model

These guarantees led to a system with several capabilities, including:

Event sourcing. All state changes are stored in a journal

Simple implementation consisting of a workflow definition and corresponding steps. This help developers to first think about the business flow and enforces a modularization of the system

Idempotency guarantees for a workflow

Workflow versioning which makes it possible to deploy changes to a workflow without any concern about currently running executions

Scalability. Parallel execution of workflows is possible by using multiple instances of each service

Novino describes the new architecture as an abstracted version of their earlier decode -> handle -> interpret pipeline with clear service boundaries between each operation:

Workflow Trigger, corresponding to decode

Workflow Executor, for handling

Side Effect Executor, corresponding interpret

For defining workflows, they have created a Domain Specific Language (DSL) that defines the series of execution steps needed. They have also included a visualizer tool that can show both running and historical workflows.

Novino notes that there are existing alternatives for workflow orchestration and design, but that they chose to build their own for a few reasons, including:

The ability to maintain separate data stores for workflow events

The ability to replay or visualize the state at any point in an execution

Extensibility and scalability

After running in production for just over a year they have created about 22 million journals and completed around 93 million workflows.

Novino concludes by noting that the migration from a distributed microservices based architecture to one based on workflows has had a dramatic effect on their overhead during design, development and support. He also notes that the ability to design workflows using a DSL and implementing them as single responsibility steps has increased their ability to build new and complex systems. In future posts he will describe the new design in more details.