We occasionally need to run complex geospatial analyses against a set of large GeoJSON files. To minimize our EC2 costs, we developed a pipeline using S3, SQS, and ECS that runs this analysis in response to the intermittent demand. Deploying the application with Docker fits nicely with our existing dev tools.

The workflow described here can be adapted for any kind of batch file processing.

Overview

First, we upload the relevant GeoJSON files to an S3 bucket. The S3 bucket has an event notification registered for .geojson files. For each uploaded file, it emits a message to an SQS queue. A CloudWatch alarm watches the size of this SQS queue. This triggers an autoscaling group and boots an ECS instance. When this instance comes up, it automatically attaches itself to the correct ECS cluster. An ECS service running on the cluster places a worker task on the new instance. The worker task runs a script that polls for SQS messages, then downloads and processes each file. When the SQS queue is empty, another CloudWatch alarm downscales the autoscaling group, and removes the instance from ECS.

Build the Docker Image

Our ECS docker image needs at least two things: the application to process each file, and a script to both poll SQS for new work and begin processing that work. An example SQS poller script:

# Listens to a queue of S3 events, Message format described here: # http://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html while [ /bin/true ] ; do msg = $( \ aws sqs receive-message \ --queue-url ${ QUEUE_URL } \ --wait-time-seconds 20 \ --output text \ --query Messages [ 0 ] . [ Body,ReceiptHandle ] ) if [ -z "${msg}" -o "${msg}" = "None" ] ; then echo "No messages left. Retry." else s3_message = $( echo "${msg}" | cut -f1 -- ) receipt_handle = $( echo "${msg}" | cut -f2 -- ) s3_bucket = $( echo "${s3_message}" | jq -r '.Records[0].s3.bucket.name' ) s3_key = $( echo "${s3_message}" | jq -r '.Records[0].s3.object.key' ) s3_path = "s3://${s3_bucket}/${s3_key}" mkdir -p work cd work aws s3 cp ${ s3_path } ./input.geojson # process the file here... process_file input.geojon cd .. rm -rf work aws sqs delete-message \ --queue-url ${ QUEUE_URL } \ --receipt-handle ${ receipt_handle } sleep 2 fi done

Our Dockerfile specifies the SQS worker script as its default command:

FROM alpine:3.4 # Install jq, needed to parse the SQS messages RUN apk --no-cache add jq # Install AWS CLI, used to poll SQS an copy from S3 RUN apk --no-cache add py-pip && pip install awscli == 1.10.26 ENV AWS_DEFAULT_REGION = us-east-1 \ AWS_ACCESS_KEY_ID = ... \ AWS_SECRET_ACCESS_KEY = ... COPY ./bin/process_file /usr/local/bin/process_file COPY ./bin/sqs_worker /usr/local/bin/sqs_worker CMD /usr/local/bin/sqs_worker

To deploy the image, we push to an ECR repository:

aws ecr get-login --region us-east-1 | bash docker build -t 123.dkr.ecr.us-east-1.amazonaws.com/app docker push 123.dkr.ecr.us-east-1.amazonaws.com/app

Configure ECS

Once we have our image in ECR, we need to configure our ECS service to use it. But first, some ECS terminology:

Task Definition : Specifies which docker image we're using for a task along with some parameters (resource limits, logging configuration, IAM role).

: Specifies which docker image we're using for a task along with some parameters (resource limits, logging configuration, IAM role). Task : A running docker container based on a task definition.

: A running docker container based on a task definition. ECS Instance : an EC2 instance running the ECS agent and attached to an ECS cluster. ECS instances run tasks.

: an EC2 instance running the ECS agent and attached to an ECS cluster. ECS instances run tasks. Cluster : A named collection of ECS instances used to run tasks.

: A named collection of ECS instances used to run tasks. Service: A long running managed collection of tasks on a cluster. These can either be autoscaled or run a fixed number of tasks.

First, we'll need a task definition for our container (from the ECS console, click Task Definitions -> Create new Task Definition). Make sure to set an IAM Task Role on the task definition with the appropriate permissions (it needs access to SQS and S3). Next, add a container based on the image in ECR: click Add Container in the Task Definition UI and specify the fully qualified image name in the Image section. Alternatively, we can define the container using JSON as described here.

Next, we need a cluster and a service to orchestrate running tasks on it. Create a new cluster from the ECS console, then click on the cluster and hit Create under the Services tab. Here, we'll want to use the task definition we just created as the basis for the service, and set the Number of tasks to 1. At this point, ECS will want to fulfill the service requirement of 1 task, but cannot, since the cluster has no instances associated with it. This is the normal scaled down behavior. Once an instance is added, the service will automatically launch our task and start polling SQS for work.

Configure Auto Scaling and CloudWatch

Finally, we need an auto scaling group to add an instance to the ECS cluster when there's demand, and remove it when there isn't.

When creating the launch configuration for the auto scaling group, choose one of the offical ECS AMIs from Amazon. These will be pre-configured with the latest ECS agent. However, by default, the agent won't know which ECS cluster it should attach to. To remedy this, we can use a user data script as part of the launch configuration. This script will set the correct ECS cluster name on the instances. For example:

#!/bin/bash echo ECS_CLUSTER = myCluster >> /etc/ecs/ecs.config

We can trigger the auto scaling group based on the number of visible messages in our SQS queue:

Here, we're scaling up whenever there's at lease one visible message in the queue. Since there exists no CloudWatch metric indicating when a queue is empty, we use the number of empty receive calls to scale down. According to the AWS docs, this is the "number of ReceiveMessage API calls that did not return a message." It follows that if this metric goes above 0, we know that our SQS poller script is getting empty responses and the queue is drained.

Configure S3

Finally, to make this whole thing work, we need to configure our S3 bucket event notification to emit ObjectCreated events.

This is as simple as clicking Add Notification under the events tab of the S3 bucket. Now, we can start uploading files and AWS will handle the rest.

Despite complexity in the inital setup, once everything is in place, deploying new applications is a simple as a docker build & docker push. ECS will automatically use the latest image in ECR.

If you're interested in solving these sorts of problems, we'd love to hear from you. We're always hiring.