What is Amazon SQS and Lambda and why should I care?

Amazon Simple Queue Service (Amazon SQS) is a distributed, fully managed message queueing service which was released as one of the first AWS services. It allows you to decouple your application into components which communicate using asynchronous messages. Using a simple, programmatic API you can get started and poll for messages that can be sent from many different sources. It acts as a buffer for your workers, greatly reducing the time spent on a synchronous call by a user – meaning you can send a response and do the work later.

In November 2014 Amazon released AWS Lambda, which is one of the most recognisable services in Cloud Computing and in my opinion – the best available implementation of Serverless paradigm. It runs code in response to certain events, eg. file uploaded to S3 or just an HTTP request. You don’t need to provision any compute resources.

But what if you want to connect these two services and make SQS messages trigger Lambda functions? We’ve been waiting for this feature for a very long time, and were tired of creating custom containers with pollers or using SNS as a bad alternative.

In Nordcloud R&D, we are partial to Serverless and event-driven paradigms however sometimes our Lambda functions call each other asynchronously and become huge, rapidly exceeding concurrency limits and throwing exceptions all over the place. Using SQS to trigger Lambda functions acts like a buffer. We know that lambda has a maximum time limit of 5 minutes, so we can use all the good things that come with SQS – visibility timeouts, at-least-once delivery, dead letter queues and so on. Now it’s possible to not have to provision any containers or EC2 instances (just Serverless code) and let Amazon handle everything for us.

But before you start using SQS as your event source for Lambda functions, you should know how it’s implemented and what to expect.

How is it implemented?

When working with SQS, you need to wait for messages to be received, process them and delete from the queue. If you don’t delete the message, it will come back after specified VisibilityTimeout, because SQS thinks the processing failed and makes it available for consuming again, so you won’t lose any messages. This process is not applicable when using SQS as an event source for Lambda as you don’t touch the SQS part!

Lambda polls for messages internally then calls your function and, if it completes successfully, deletes the message on your behalf. Make sure that your code throws exceptions if you want to process the message again. Equally important is that you need to return a successful code so you won’t get into an endless loop of duplicated messages. Remember that you are billed for every API call that is made by the internal poller.

Another thing is that Lambda is invoked synchronously. There’s no retries and the Dead Letter Queue on Lambda has no use. Everything will be handled by Amazon SQS, so find the optimal settings for VisibilityTimeout, maxReceiveCount and definitely configure DLQ policy. Even though it shouldn’t be a problem, please refrain from setting the VisibilityTimeout equal to the function timeout, as the polling mechanism will consume some additional time and it will be counted as in a processing state.

You are also limited by the function level concurrent execution limit which defaults to a shared pool of unreserved concurrency allocation (1000 per region). You can lower that by specifying the reserved concurrent executions parameter to a subset of your account’s limit. However, it will subtract that number from your shared pool and it may affect other functions! Plus, if your Lambda is VPC-enabled then Amazon EC2 limits will apply (think ENI).

If you like taking Amazon SQS up a level like us, you’ll notice that the number of messages in flight will begin to rise. That’s your Lambda gradually scaling out in response to the queue size, eventually hitting the concurrency limit. These messages will be consumed and synchronous invocation will fail with an exception. That’s when your Amazon SQS retry policy comes in hand. Although it is not confirmed anywhere, this behaviour may lead to starvation of certain messages, but you should be already prepared for that!

One more thing from our R&D division. What happens if you add one queue as an event source for two different functions? That’s right, it will act as a load balancer.

Does it really work?

We ran some tests with the following assumptions:

all messages were available in the queue before enabling the Lambda trigger

SQS visibility timeout is set to 1h

all test cases are in separate environments and time

Lambda does nothing, just sleeps for some specified amount of time

This is what we got:

Normal use case

1000 messages, sleep for 3 seconds – nothing really interesting, works as good as we expected it to, it consumed our messages pretty quickly, Cloudwatch didn’t even register the scaling process.

Normal use case, heavy load

Again, 3 seconds sleep but 10000 messages. This is over our concurrency limit, but the scale-out process took more than executing first Lambdas, so it didn’t throttle. It took a little bit longer to consume all of our messages.

Long-running lambdas

Let’s get back to 1000 messages, but with 240 seconds of sleep. Now AWS is handling the scale-out process for internal workers. You’ll have noticed we have managed to get about 550 concurrent lambdas running. Good news!

Hitting the concurrency limit

Again, 240 seconds of sleep but let’s push it to the limit: 10000 messages, concurrency limit set to 1000.

What happened? Again, AWS reacts to the number of messages available in Amazon SQS, so scales internal workers up to a certain point, when the concurrency limit is reached. Of course, in the world of distributed computing and eventual consistency, there is no way it can predict how many Lambdas it can run, so we can finally see it throttle. Throttled Lambdas return exceptions to AWS workers – that’s the signal to stop, but it still tries because perhaps that’s not our global limit and it’s just other functions taking our pool. What is important is that AWS won’t retry function execution, this message will come back to the queue after defined VisibilityTimeout, you’ll see some invocations after 23:30 (yes, we can’t sleep).

The same thing happens when you set your own reserved concurrency pool. We ran the same test for a maximum of 50 concurrent executions. Based on the throttling, it was too low.

Multiple Lambda workers

This is simply awesome! Amazon SQS gives you a possibility to subscribe multiple functions to one queue! We sent 10000 messages to a queue set as an event source for 4 different functions. You’ll notice that every Lambda was executed about 2500 times. It means that this setup behaves like a load-balancer. However, it’s not possible to subscribe Lambdas from different regions and create a global load balancer.