AWS recently introduced a new feature that allows you to automatically pull messages from a SQS queue and have them processed by a Lambda function. I recently started experimenting with this feature to do ETL (“Extract, Transform, Load”) on a S3 bucket. I was curious to see how fast, and at what cost I could process the data in my bucket. Let’s see how it went!

Note: all the code necessary to follow along can be found at https://github.com/PokaInc/lambda-sqs-etl

The goal

Our objective here is to load JSON data from a S3 bucket (the “source” bucket), flatten the JSON and store it in another bucket (the “destination” bucket). “Flattening” (sometimes called “relationalizing”) will transform the following JSON object:

{

"a": 1,

"b": {

"c": 2,

"d": 3,

"e": {

"f": 4

}

}

}

into

{

"a": 1,

"b.c": 2,

"b.d": 3,

"b.e.f": 4

}

Flattening JSON objects like this makes it easier, for example, to store the resulting data in Redshift or rewrite the JSON files to CSV format.

Now, here’s a look at the source bucket and the data we have to flatten.

Getting to know the data

Every file in the source bucket is a collection of un-flattened JSON objects: