Winston Wolfe

Pulp Fiction. A classic. Harvey Keitel played this memorable character who was a cleaner. I needed a cleaner. So, I created the Wolf. The Wolf uses a few different AWS services operating together to manage the data we need to remove.

The core service used to do all of this is AWS Fargate. Fargate is (IMHO) the easiest way to orchestrate containers on AWS. As a long time AWS EC2 Container Service (ECS) user, I’m always amazed at how simple Fargate has made managing containers and being able to deploy large numbers of containers in a snap. I was able to use the new Fargate Spot capability to make running all of this super cost effective.

Let’s drill into some details about how this all works. Alas, the solution contains some proprietary code so it is in a private repo but I’ll include some relevant code snippets where appropriate. The whole thing is essentially a couple of bash scripts using the AWS CLI packaged into a single Docker container (Alpine Linux — lightweight!) with the entire deployment in a Cloudformation template.

Here’s the diagram. There are a few moving parts so let me explain them

The 2 basic components are the dispatcher and the purger. These are packaged into a single Docker container that just has 2 different entry points. Two ECS Fargate task definitions are created to describe how to launch either the dispatcher or the purger(s).

In step 1, the dispatcher starts up as an ECS scheduled task and connects to our main account database for a list of accounts that should have their data removed. The dispatcher then starts up a new Fargate (spot) task to purge this data (the purger). There can be many purgers running concurrently depending on how many accounts we need to process. The dispatcher accepts a parameter for the maximum number of purgers that can be in flight at any one time due to limits on S3 lifecycle rules per bucket.

A quick note on how Rewind data is segmented: We store data in the same region as the source we are backing up. I.e. If a customer has a store in the EU, we have all of the infrastructure to store and process this in the eu-west-1 (Ireland) AWS region. As such, when we start the Fargate purger tasks, we ensure to start them in the region the data is stored in. This is the beauty of Fargate. With “ECS v1”, we’d have needed a cluster of instances running to run these containers. Most of the time, they’d be idle or highly under utilized. But with Fargate being totally server less (and Fargate spot being incredibly cheap), we need no capacity on the bench.

Here’s how the dispatcher starts a purger container for a specific account using the Fargate Spot capacity provider:

purger_task_arn=$(aws ecs run-task \

--count 1 \

--capacity-provider-strategy capacityProvider=FARGATE_SPOT,weight=1,base=100 \

--cluster "${cluster_name}" \

--task-definition winston-wolfe-purger \

--propagate-tags TASK_DEFINITION \

--enable-ecs-managed-tags \

--overrides file:///tmp/purger_overrides.json \

--network-configuration "awsvpcConfiguration={subnets=[${subnet_id}],securityGroups=[${winston_sg_id},${db_sg_id}]}" \

--query 'tasks[*].taskArn' \

--region "${account_region}" \

--output text)

We pass the account ID and some other data into the task by making use of overrides to the task. Essentially, this allows us to specify different values for environment variables which are passed to Fargate containers.

I make no secret of the fact that I’m a huge fan of the AWS CLI and find that with it’s --query option, writing scripts around it to perform powerful actions is a snap.

3. The purge. So we have a Fargate container running, what does it do? Primarily, it removes data from S3 but there are a few other data stores we clean at the same time. The S3 removal is accomplished by dynamically manipulating lifecycle rules on the bucket.

The lifecycle manipulation is a 2-pass solution:

Pass 1 adds a lifecycle rule for the account prefix to purge (data is organized in S3 by prefixes mapped to the account ID). It does not mark the account as purged

mark the account as purged Pass 2 re-visits the same account ID and checks if the data purge is complete. If it is, the rule is removed and the account is marked as purged.

Here’s where I ran into a bit of a snag. See, in the AWS console, the lifecycle rules management looks like this:

It looks like you can add and remove individual lifecycle rules, whereas in fact, the call to get/set lifecycle rules gets/sets the entire set! This is a problem because we have multiple containers running in parallel operating on the same bucket lifecycle rules. Some kind of lock is required to prevent concurrent operation on a given buckets lifecycle rules.

DynamoDB — The Wonder Service

A solution I had used before was using a row in a DynamoDB table as a lock by using a conditional update. Conditional updates will fail in DynamoDB if the attribute being updated does not match the condition specified. So the Winston cloudformation table creates a DynamoDB table with just a single row per S3 bucket we need to manage in the region. Obtaining a lock for the bucket then just looks like this:

aws dynamodb update-item \

--table-name "${ddb_lock_table}" \

--key "{\"bucket\": {\"S\": \"${s3_bucket}\"}}" \

--update-expression "SET lock_state = :new_state" \

--condition-expression "lock_state = :unlocked_state OR attribute_not_exists(lock_state)" \

--expression-attribute-values '{ ":new_state": { "S": "locked" }, ":unlocked_state": { "S": "unlocked" } }'

This is just run in a loop, checking the return code. If we were able to perform the update, we have “obtained the lock” and can carry on. If the update fails, we retry for some time until we can perform the update. Unlocking is just the reverse condition.

Manipulating Lifecycle Rules

So, how do we then go about manipulating the rules? Using the lock,it’s fairly straightforward with the CLI:

aws s3api put-bucket-lifecycle-configuration \

--bucket "${s3_bucket}" \

--region "${AWS_REGION}" \

--lifecycle-configuration "file://${lifecycle_config_json_file}"

The lifecycle config JSON file always contains the full set of rules. I was able to manipulate this using jq. Removing a rule from the file is done as follows:

jq --arg aid "${ACCOUNT_ID}" \

'del(.Rules[] | select(.ID == $aid))' "${bucket_lfecycle_rules_file}" \

> "${new_bucket_lifecycle_rules_file}"

And adding a rule is done using:

jq \

--arg aid "${ACCOUNT_ID}" \

--arg pid "${PLATFORM_ID}" \

--arg days "${purge_after_days}" \

'.Rules += [{"Filter":{"Prefix":$pid},"Status":"Enabled","NoncurrentVersionExpiration":{"NoncurrentDays":$days|tonumber},"Expiration":{"Days":$days|tonumber},"ID":$aid}]' \

"${bucket_lfecycle_rules_file}" >

"${new_bucket_lifecycle_rules_file}"

Putting this all together, the flow when the purger starts is: