Clean all the things (and fish)

I’ve written before about how easy DynamoDB is to manage and store data. Lots of data. Sometimes, too much data! Given how inexpensive storage is for DynamoDB, it’s tempting to just leave data in there forever. However, this eventually catches up with you if you ever need to do table scans in the form of an ever more increasing amount of provisioned throughput required.

It’s possible to have a series of scripts, code, etc to prune data from tables with simple deletes matching some criteria. But these consume write throughput and it feels very “old school”. What if there was some way to prune data from tables “for free”? There is! Back in Feb 2017, AWS added a feature called “TTLs” to DynamoDB which we’ve just started using and in the words of Kenny Bania, it’s “Gold Jerry, Gold”

Set it and…forget it!

I hate to use a tagline from a previous blog article but, like DynamoDB autoscaling, it’s another case of “turn it on, go to sleep, magic happens”. TTLs are really easy to start using as long as you have an attribute in the table that can be used as an expiry time. The basic workflow with TTLs is:

Designate an attribute in the table that contains the expiry time. This must be an epoch time There is no step 2. DynamoDB prunes data in the background WITHOUT using any provisioned throughput

Our use case

One of our main products/services is Media Shuttle which is a high speed file transfer solution for very large files. When a user sends a file or collection of files to another user, the recipient receives an email informing them they can download. We send these emails today with AWS Simple Email Service (SES) which is..uhh…simple. SES allows you to capture records of all deliveries and bounces to a Simple Notification Service (SNS) topic where you can email them or process them.

We’ve written with AWS about how we process these delivery and bounce records with a Lambda function and write them to DynamoDB in the past. These meta-records are used almost exclusively by our product support specialists to help in troubleshooting calls of the form “I did not get my email”. Usually these calls are “near term” to when a delivery has been sent so the lifespan of having these email delivery or bounce records is pretty short. This solution looks like:

Processing SES bounces and deliveries with Lambda

We figured we’d just keep the records forever in DynamoDB since they are small and not queried every day. However, the deliveries table has now grown pretty large and as we allow wildcard queries on this table with our GUI tool, it means a table scan which consumes quite a large amount of provisioned throughput. What if we could trim this data automatically…if only there were TTLs.

Hammertime

To enable TTLs for this existing application, we had to do 3 things:

Modify the existing Lambda functions to add a new attribute to the tables whenever we inserted a new record. This was as simple as adding a quick calculation to insert a value 120 days in the future:

future = datetime.datetime.utcnow() + datetime.timedelta(days=120)

expiry_ttl = calendar.timegm(future.timetuple())

:

: # Add entry to DB for this recipient

Item={

'recipientAddress': recipientEmailAddress,

'sesMessageId': sesMessageId,

'sesTimestamp': long(sesTimestamp_seconds),

'deliveryTimestamp': long(deliveryTimestamp_seconds),

'processingTime': long(processingTime),

'reportingMTA': reportingMTA,

'smtpResponse': smtpResponse,

'sender': sender.lower(),

'expiry': long(expiry_ttl)

}

Enable TTLs on the table and tell it to use the new expiry attribute as the TTL

Enabling TTL for table

Finally, we wanted to backfill all existing records so they would be expired. Luckily, we already had an attribute in the database that contains the timestamp the record was added. So we created a small script to update all records that did not have an expiry with one 120 days in the future. It’s not the prettiest but basically all we do is scan the table using the filter expression attribute_not_exists(expiry). Then for each record returned, we just pop in an expiry. Alas, there is no batch update functionality so this has to process each record sequentially. Here’s what we used:

import boto3

import datetime

import calendar def get_expiry(added_time):

future = datetime.datetime.fromtimestamp(float(added_time)) + datetime.timedelta(days=120)

expiry_ttl = calendar.timegm(future.timetuple()) return expiry_ttl def update_item(item, expiry, client, table_name):

recipient = item['recipientAddress']['S']

message_id = item['sesMessageId']['S']

print "updating item " + recipient + " / " + message_id + " with expiry " + str(expiry) response = client.update_item(

TableName=table_name,

Key={

'recipientAddress': {

'S': recipient,

},

'sesMessageId': {

'S': message_id,

},

},

ExpressionAttributeNames={

'#ED': 'expiry'

},

ExpressionAttributeValues={

':ed': {

'N': str(expiry),

}

},

ReturnValues='ALL_NEW',

UpdateExpression='SET #ED = :ed'

) if 'ResponseMetadata' in response:

if 'HTTPStatusCode' in response['ResponseMetadata']:

if response['ResponseMetadata']['HTTPStatusCode'] != 200:

print "** Error updating " + recipient + " / " + message_id

else:

print "+ Successfully updated " + recipient + " / " + message_id

else:

print "** No http status code in response - error "

else:

print "** No response metadata" if __name__ == "__main__":

table_name = 'SES_BOUNCES'

region = 'us-east-1' session = boto3.session.Session(region_name=region)

dynamodb_client = session.client('dynamodb') # This will give us all entries without an expiry set

filter_expression = "attribute_not_exists(expiry)" response = dynamodb_client.scan(

TableName=table_name,

FilterExpression=filter_expression,

) for item in response['Items']:

create_time = item['sesTimestamp']['N']

expiry_ttl = get_expiry(create_time)

print "Item was added on " + str(create_time) + " and expires on " + str(expiry_ttl)

update_item(item, expiry_ttl, dynamodb_client, table_name) while 'LastEvaluatedKey' in response:

response = dynamodb_client.scan(

TableName=table_name,

FilterExpression=filter_expression,

ExclusiveStartKey=response['LastEvaluatedKey']

) for item in response['Items']:

create_time = item['sesTimestamp']['N']

expiry_ttl = get_expiry(create_time)

print "Item was added on " + str(create_time) + " and expires on " + str(expiry_ttl)

update_item(item, expiry_ttl, dynamodb_client, table_name)

Results

The backfill took a LONG time to run as there are millions of rows in the table. However, once it was complete I came into the office the next day to see 1 million rows had been “expired”. Free. No provisioned throughput used. No Spartans were harmed in this expiry.

One other case we looked into is saving off the data that was removed by the TTL process. It turns out that you can identify records removed by a TTL in the dynamo streams so we could have a trigger (Lambda) that looks for these and writes them to S3, Redshift, etc. That’s not a requirement for this specific use case but it’s nice to know the option is there as we have other cases where this will almost certainly be a requirement.