It takes 7 steps to deliver collected and normalized data to our final data storage — Elasticsearch:

CloudWatch Event triggers Lambda Lambda gets the RSS feeds URLs from DynamoDB The same Lambda sends all those RSS URLs to SQS Each RSS endpoint from SQS is picked-up by Lambda that reads and normalizes the feed News articles data is inserted into DynamoDB DDB streams collect up to X newly inserted records and send it to Lambda Lambda inserts new records into Elasticseearch

Step 1. CloudWatch Event triggers the “orchestration” lambda

AWS Lambda is a serverless Function-as-a-service tool that runs your code in response to an event. You do not have to maintain servers. You pay only for the time that function is executed. CloudWatch Event is an AWS implementation of a cron job.

You can set up CloudWatch Event to trigger your Lambda each X minutes, hours, days, etc.

Step 2. “Orchestration” lambda gets the RSS feeds URLs from DynamoDB

DynamoDB table contains all the RSS feeds that we use to update our news database

DynamoDB is a fully managed No-SQL database from AWS. Same as AWS Lambda you do not have to manage the hardware or software behind it. You use it as an out-of-a-box solution to store your data.

Step 3. “Orchestration” lambda sends all those RSS URLs to SQS

Now you have a list of all the RSS feeds that have to be processed.

There are thousands of them so making a loop to process each one is not an option.

Let’s assume we have another Lambda function that can read and normalize the RSS. AWS Lambda allows you to make hundreds of parallel calls. Hence, you can think of calling many other Lambdas from your current Lambda.

Such an approach should work, however, it makes your process complex and prone to fails.

So you need something in the middle.

Simple Queue Service is a fully managed queue service by AWS.

Instead of triggering RSS processing Lambdas from our “orchestration” Lambda, we will send all the RSS endpoints to SQS as messages. Then, new messages from SQS can trigger RSS processing Lambda.

By adding the SQS layer we made it much easier to check the success on each RSS endpoint processing. If one of the Lambdas fails it does not disrupt any other Lambdas that process other RSS feeds.

Step 4. Each RSS endpoint from SQS is picked-up by Lambda that reads and normalizes the feed

My favorite feature of AWS Lambda is that you do not pay more for being able to execute many Lambdas simultaneously. It makes Lambda a useful tool when you do not know your workload ahead or want to be sure that your system will not crush when it receives more loads.

This Lambda is a Python function that takes RSS URL as an input and returns structured data (title, published DateTime, authors, article URL, etc.)

Now, when we extracted and normalized news articles’ data from each RSS feed it has to be stored.

Step 5. News articles data is inserted into DynamoDB

Before sending data to the Elasticsearch cluster we put it to DynamoDB. The main reason is that Elasticsearch does fail. Elasticsearch is great for many things (that we will discuss later) but Elasticsearch cannot be your main data storage.

Deduplication. Each RSS feed is updated once in a while. It adds some more articles to the feed. For example, the RSS feed of news website X always contains 100 articles. The feed is updated once every hour. If during this hour there were 10 new articles then they will appear on the top. While the oldest 10 articles will be removed. The other 90 articles will be the same as an hour ago. Hence, all those 90 articles are duplicates in our database.

And we have thousands of such feeds.

So it would be nice if we could verify if each ID (whatever it is) already exists in our database.

I got neat advice from Reddit:

Thx to Vladimir Budilov from AWS for this and much other help

The trick is to make sure that we have a consistent ID key for each news article. We are MD5ing title+url of each article. Then when we insert this data to DynamoDB only new IDs are allowed (check attribute_not_exists settings of DynamoDB for that).

Hence, we do not have to deduplicate the data ourselves.

Step 6. DDB streams collect up to X newly inserted records and send it to Lambda

Now, we have to send new data from DynamoDB to the Elascticsearch cluster.

We made a trigger on DynamoDB inserts. Each time any data is inserted into DynamoDB it is picked up by Lambda that sends this new data into Elasticsearch cluster.

Loading data into DynamoDB we significantly reduce the load on ES because it will receive a much lower amount of new data. Plus, DynamoDB is our main source of data.

Step 7. Lambda inserts new records into Elasticsearch

DynamoDB takes all the heavy work on making sure we do not have duplicates in our database. Only newly inserted records go to the Elasticsearch cluster.

You can set your Lambda to wait for new X records or Y minutes and take multiple records at once. This way you will have fewer transactions, plus you can take advantage of Elasticsearch bulk inserts.

Why Elasticsearch?

Because it is the best way to work with full-text search.

Elasticsearch is complex, you can tweak almost anything. Knowing well the purpose and use of your data allows you to optimize your cluster for the best performance.

Nevertheless, for the beta purpose, we used the default setting almost everywhere.

Data Delivery/API

Up to this point, we made sure our Elasticsearch cluster gets updated with new data. Now, we have to provide users with a tool to interact with our data. We have chosen to make it a RESTful API.

My favorite combination is to do it with API Gateway + Lambda.

From the AWS page:

API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, CORS support, authorization and access control, throttling, monitoring, and API version management.

So API Gateway is responsible to manage API requests. We still have to implement the logic level. Each API call will be processed by the Lambda function.

Lambda itself is written with a micro web framework called Flask.

The task of this Lambda is to parse the parameters that users pass (such as the phrase that they want to find the articles on). Then, it queries our Elasticsearch cluster. Finally, it composes a clean JSON response object that will be sent to the user.

Apart from Flask, we used elasticsearch-dsl-py package that helps with writing and running queries against Elasticsearch.

I recommend to deploy API Lambdas with Zappa:

Zappa makes it super easy to build and deploy server-less, event-driven Python applications (including, but not limited to, WSGI web apps) on AWS Lambda + API Gateway. Think of it as “serverless” web hosting for your Python apps. That means infinite scaling, zero downtime, zero maintenance — and at a fraction of the cost of your current deployments!

Deploying your API with Lambda is most convenient when you are not sure how many calls you will have to serve.

More to that, if you have 0 calls, you pay 0.