AWS Bread, recipe below

As developers we are pretty good at writing fast code because we put a lot of emphasis on that skill (especially in job interviews). Where we have a little more trouble is writing slow code, processes that don’t take milliseconds or seconds to run (e.g. a web request), but take minutes, hours, or days (e.g. data backup and migration) to complete.

As a process takes longer to complete, some qualities become much more important:

Reliability: understanding error conditions, implementing good retry logic to mitigate failure, in the case of unrecoverable failure ensuring we end in a good state. Visibility: seeing progress to ensure it is working correctly, inspecting state of the running process, when it fails it reports why and where it failed so we can mitigate in the future. Understandability: using abstractions to reason about the whole process without needing to understand details, knowing the potential paths of the process to know if it is acting correctly.

In this post I am going to explore some of the difficulties writing reliable slow code by trying to bake some bread. I am also introducing step , a new Go framework that uses AWS Step Functions with Lambda to write reliable slow code.

Baking Bread

Baking bread takes time, if you rush it you will fail, so I often follow this recipe:

mix 140g flour, 1/2 cup water, yeast wait 12 hours add 140g flour, 1/2 cup milk, salt knead until the dough is cohesive wait 2 hours bake at 250C until golden brown (about 20 mins) take out, cool, eat

The whole process takes about a day. Most of that time is spent waiting for an external process: the yeast to leaven. Also, variability means that we have to keep checking to see if the bread has finished baking. The entire process is slow and tedious, so lets automate it.

Code

We could bake bread by writing a straightforward script:

bread = Bread.new

bread.mix({flour, water, yeast})

sleep 12.hours

bread.mix({flour, water, salt})

until bread.cohesive? { bread.knead() }

sleep 2.hours

bread.bake(250)

until bread.golden? { sleep 60 }

bread.remove.cool.eat

This code is easy to understand and with a few well placed logging statements it would be easy to see its progress. So it is understandable with good visibility, but is it very reliable?

If the process dies while it is baking, we could have some big problems, e.g. the oven is left on with no one watching it. It is difficult to add timeout logic, so if the oven breaks, the bread would never bake and we would starve in an infinite loop. Also, with no retry logic or any error handling a small interruption like a phone call could make us forget what state we are in have to restart the entire process again. This code is pretty fragile.

Jobs

Let’s improve the script by using a job running framework like sidekiq or cron:

class Job1

def run

bread = Bread.new

bread.mix({flour, water, yeast})

Job2.schedule(bread).in(12.hours)

end

end class Job2

def run(bread)

bread.add(more_stuff)

bread.mix({flour, water, salt})

until bread.cohesive? { bread.knead() }

Job3.schedule(bread).in(2.hours)

end

end class Job3

def run(bread)

bread.bake(250)

until bread.golden? { sleep 60 }

bread.remove.cool.eat

end

end

Serializing the state of the bread after each job would achieve the following:

Add visibility to the process as now you can see the input and output of each job. Let the framework schedule where each job is run, so physical machine (oven) failure or replacement is not an issue. Allow the framework to retry any job as it can be rerun with its persisted input.

The trade-off is that it is now harder to understand the whole process. Finding out how the process got to Job(N) you must know Job(N-1) ’s code and state, because that is where the structure is defined. This way of starting a job is like old school GOTO statements, where the programmer has full control of their programs structure to their own detriment. As Dijkstra put it:

“our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed”

That is to say, in the future as you become a better baker and try more complex bread recipes with parallel tasks and conditional paths, you may not be able to understand how the overall process works.

A framework that provides “static relations” and gives holistic understanding would describe how each job relates to the whole, e.g.:

Job1:

Next Job2

Job2:

Next Job3

Job3:

End true

This makes the structure of the process and all the paths that it can take explicit. That kind of framework looks a lot like a state machine…

AWS Step Functions and State Machines

A state machine would be a great framework to write slow code in, especially if it is highly available, has good visibility, and good tooling. While looking for such a framework I came across AWS Step Functions which are hosted state machines defined in JSON that can call external code in Lambda functions. This “serverless” choice of framework has some great advantages:

The state machine JSON is described in a well defined specification. Step Functions can run for a year (that is very slow). The entire history of a process, including all visited states, inputs, outputs and errors is accessible via the execution history API. Billions of state transitions are run each day, so the underlying framework is incredibly reliable. Retry, error handling, timeouts are all defined in the state machine, strongly separating structure and implementation.

A (simplified) state machine that bakes bread looks like:

{

"StartAt": "InitMix",

"States": {

"InitMix": { "Next": "WaitForLeaven" },

"WaitForLeaven": {

"Seconds": 43200,

"Next": "Mix&Knead"

},

"Mix&Knead": { "Next": "WaitForRise" },

"WaitForRise": {

"Seconds": 7200,

"Next": "Bake"

},

"Bake": { "Next": "WaitForGolden" },

"WaitForGolden": {

"Seconds": 300,

"Next": "Golden?"

},

"Golden?": {

"Choices": [

{

"Variable": "$.golden",

"BooleanEquals": true,

"Next": "RemoveCoolEat"

}

],

"Default": "Bake"

},

"RemoveCoolEat": { "End": true }

}

}

AWS renders this state machine to look like:

Baking Bread State Machine

We can clearly see all possible paths for the process to take, and we just need to implement the code for InitMix , Mix&Knead and Bake . However, this might be difficult because tools to build, test and deploy AWS Step Functions are absent. Until now…

step the AWS Step Function Framework

By simplifying some aspects of a Step Function, like only having a single Lambda, and building a set of tools for local testing, our new framework step , written in Go, can be used to develop, test and deploy Step Functions. The three core components of step :

Library: tools for building and deploying Step Functions in Go. Implementation: of the AWS State Machine specification to test entire executions. Deployer: to deploy Lambda’s and Step Functions securely.

To code for the above state machine to bake bread looks like:

type Bread struct {…}

func InitMix(_ context.Context, interface{}) (*Bread, error) {

bread := Bread{Flour, Water, Yeast}

bread.Mix()

return bread, nil

} func MixAndKnead(_ context.Context, bread *Bread) (*Bread, error) {

bread.MixIn({Flour, Water, Salt})

bread.Knead()

return bread, nil

} func Bake(_ context.Context, bread *Bread) (*Bread, error) {

bread.Bake(250)

return bread, nil

}

Combining these functions with the above state machine and testing them together looks like:

import “ github.com/coinbase/step/machine sm := machine.FromJSON(`<state machine JSON above`)

// Attach Functions to States

sm.SetResourceFunction("InitMix", InitMix)

sm.SetResourceFunction("Mix&Knead", MixAndKnead)

sm.SetResourceFunction("Bake", Bake) bread, err := sm.Execute(nil)

…

assert.Equal(t, []string{

"InitMix",

"WaitForLeaven",

"Mix&Knead",

"WaitForRise",

"Bake",

"WaitForGolden”,

"Golden?",

"RemoveCoolEat",

}, sm.ExecutionPath())

bread.Eat()

This is a high level view of how to use step , to see the nitty-gritty have a look at the step-hello-world repo, or the step code.

Deploying Deployers With Deployers

Once a state machine has been built, step provides a way to deploy it to AWS. step-deployer is a Step Function that can deploy Step Functions. This makes it a “recursive deployer” because it can deploy itself. step-deployer 's state machine looks like:

step-deployer state machine

The core states of the step-deployer are:

Validate: validate the sent release bundle Lock: grab a lock on the deployed project ValidateResources: ensure the resources exist and correct for this project Deploy: update the Step Function and lambda, then release the lock ReleaseLockFailure: try release the lock and fail

The end states are:

Success: everything deployed correctly FailureClean: something went wrong but recovered to a good state FailureDirty: something went wrong and left in a bad state.

To deploy using step we can use step as a command line tool:

step deploy -lambda <lambda name> \

-step <step-fn-name> \

-states <state-machine-json>

For example, to deploy the step-deployer we:

go build . # Build&Install step in your operating system

go install # Build step for linux lambda

GOOS=linux go build -o lambda

zip lambda.zip lambda step deploy -lambda "coinbase-step-deployer" -step "coinbase-step-deployer" -states "$(step json)"