try {

//Write updates to daily rollup table

await documentClient.update(params).promise();

} catch (err) {

//Swallow any errors

console.error(err);

} ... //Note we don't actually fail the lambda function here by calling back with the error e.g. callback(err)

callback(null, `Swallowed the error ${JSON.stringify(err)}`);

Here we are swallowing any errors that occur in our function and not triggering the callback with an error. E.g. Not calling callback(err). This is because your Lambda will get triggered with a batch of events in a single invocation (this can be changed by setting the BatchSize property of the Lambda DynamoDB Stream event source), and you generally don’t want to fail the entire batch. This will be discussed more below.

Gotchas and Lessons Learned

Kinesis, Batch Size, Error Handling, and Partial Failure

Under the hood, DynamoDB uses Kinesis to stream the database events to your consumer. By its nature, Kinesis just stores a log of events and doesn’t track how its consumers are reading those events. It simply provides an interface to fetch a number of events from a given point in time. It’s up to the consumer to track which events it has received and processed, and then request the next batch of events from where it left off (luckily AWS hides this complexity from you when you choose to connect the event stream to a Lambda function). This is a different paradigm than SQS, for example, which ensures that only one consumer can process a given message, or set of messages, at a given time. In SQS you can then delete a single message from the queue so it does not get processed again. In Kinesis there is no concept of deleting an event from the log.

The inability to control the set of events that is coming from the stream introduces some challenges when dealing with errors in the Lambda function. There is no concept of a partial success. I.E. you can’t send information back to the stream saying: “I processed these 50 events successfully, and these 50 failed, so please retry the 50 that failed”. If you fail your entire Lambda function, the DynamoDB stream will resend the entire set of data again in the future. This is problematic if you have already written part of your data to the aggregate table. How do you prevent duplicate records from being written? And how do you handle incoming events that will never succeed, such as invalid data that causes your business logic to fail? There is no silver bullet solution for this case, but here are some ideas:

Do some data-sanitization of the source events. If you can identify problems and throw them away before you process the event, then you can avoid failures down-the-line. Log the failures and possibly set up some CloudWatch Alarms to notify you of these unexpected cases. We used CloudWatch Metric Filters to translate our log errors into metrics we could observe and trigger alarms for.

Perform retries and backoffs when you encounter network or throughput exceptions writing to the aggregate table. This provides you more opportunity to succeed when you are approaching your throughput limits. If you are using an AWS SDK you get this for free.

If all else fails, write the event you are currently processing to some secondary storage. You cannot throw away this data if you want your destination table to be an accurate aggregate of the source table. Writing the event to an SQS queue, or S3, or even another table, allows you to have a second chance to process the event at later time, ideally after you have adjusted your throughput, or during a period of lighter usage. We implemented an SQS queue for this purpose.

Set your BatchSize to 1. I wouldn’t generally recommend this, as the ability to process and aggregate a number of events at once is a huge performance benefit, but it would work to ensure you aren’t losing data on failure. Simply trigger the Lambda callback with an error, and the failed event will be sent again on the next invocation. With this approach you have to ensure that you can handle events quickly enough that you don’t fall too far behind in processing the stream. A DynamoDB stream will only persist events for 24 hours and then you will start to lose data. You can monitor the IteratorAge metrics of your Lambda function to determine how far behind you might be.

DynamoDB Throughput, Concurrency, Partitions, and Batch Writes

Although DynamoDB is mostly hands-off operationally, one thing you do have to manage is your read and write throughput limits. Setting these to the correct values is an inexact science. Set them too low and you start getting throughput exceptions when trying to read or write to the table. Set them too high and you will be paying for throughput you aren’t using. Auto-scaling can help, but won’t work well if you tend to read or write in bursts, and there’s still no guarantee you will never exceed your throughput limit.

In our scenario we specifically care about the write throughput on our aggregate table. We want to allow our Lambda function to successfully write to the aggregate rows without encountering a throughput exception. The logical answer would be to set the write throughput on the aggregate table to the same values as on the source table. After all, a single write to the source table should equate to a single update on the aggregate table, right? Unfortunately, the answer is a little more complicated than that.

First, you have to consider the number of Lambda functions which could be running in parallel. E.g. if you are running two Lambdas in parallel you will need double the throughput that you would need for running a single instance. The potential number of Lambdas that could be triggered in parallel for a given source table is actually based on the number of database partitions for that table. There is one stream per partition. Unfortunately there is no concrete way of knowing the exact number of partitions into which your table will be split. It is a factor of the total provisioned throughput on the table and the amount of data stored in the table that roughly works out to something like

Total partitions =

MAX(Total partitions for desired performance,

Total partitions for desired capacity)

See this article for a deeper dive into DynamoDB partitions. You can get a rough idea of how many Lambda functions are running in parallel by looking at the number of separate CloudWatch logs your function is generating at any given time. There should be about one per partition assuming you are writing enough data to trigger the streams across all partitions.

You can also manually control the maximum concurrency of your Lambda function. For example, if you tend to write a lot of data in bursts, you could set the maximum concurrency to a lower value to ensure a more predictable write throughput on your aggregate table. Again, you have to be careful that you aren’t falling too far behind in processing the stream, otherwise you will start to lose data.

Secondly, if you are writing to the source table in batches using the batch write functionality, you have to consider how this will affect the number of updates to your aggregate table. For example, a batch write call can write up to 25 records at a time to the source table, which could conceivably consume just 1 unit of write throughput. This will translate into 25 separate INSERT events on your stream. Since updating an item with update expressions cannot be done in batches, you will need to have 25x the throughput on the destination table to handle this case.

There is opportunity for optimization, such as combining the batch of events in memory in the Lambda function, where possible, before writing to the aggregate table. In practice, we found that having the write throughput on the aggregate table set to twice that of the source comfortably ensures we will not exceed our limits, but I would encourage you to monitor your usage patterns to find the number that works for your case.

In conclusion

Using the power of DynamoDB Streams and Lambda functions provides an easy to implement and scalable solution for generating real-time data aggregations. As a bonus, there is little to no operational overhead. The pattern can easily be adapted to perform aggregations on different bucket sizes (monthly or yearly aggregations), or with different properties, or with your own conditional logic. You could even configure a separate stream on the aggregated daily table and chain together multiple event streams that start from a single source.

There are a few things to be careful about when using Lambda to consume the event stream, especially when handling errors. Understanding the underlying technology behind DynamoDB and Kinesis will help you to make the right decisions and ensure you have a fault-tolerant system that provides you with accurate results.

Happy streaming!