AWS Lambda error handling is a challenge for every new Lambda user. The Lambda retry mechanism sometimes makes it difficult to follow what’s going on in your serverless application.

In this post you will understand:

How AWS Lambda errors and Lambda retry work, and what’s the idea behind it. What consequences it has on your code. How to build your system using AWS Step Functions to control AWS Lambda error handling. You will also get a useful resource for doing that.

Anyone familiar with serverless knows that it does not only mean executing your monolithic code on a Lambda function. It is a different architecture of your whole system. In this architecture, distributed nodes activated by asynchronous events are composing the system. Each node must be designed as an independent component which has its API (a “black box”), even when not exposing it to the outside world.

So how can we know how to define these nodes accurately? It turns out that it has a lot to do with correct Lambda error handling. And, of course, dealing correctly with the AWS Lambda retry behavior.

Lambda Retry Behavior

Lambda functions can fail in three cases:

An unhandled exception is raised — whether if we received an invalid input, an external API failed, or just a programming bug occurred. Timeout — Lambda running longer than the configured timeout duration is violently closed with a ‘Task timed out after … seconds’ message. The default value is 6 seconds, and the maximal value is 5 minutes. Out of memory — In this case, the lambda usually terminates with ‘Process exited before completing request’. The ‘Memory Size’ is equal to ‘Max Memory Used’.

When that happens (and be sure that it will), you will probably see your Lambda retry according to the following behavior:

1. Synchronous events

In event sources such as API Gateway or synchronous invocation using the SDK, the invoking application is responsible for making retries according to the response it gets from the Lambda. This is the least interesting case because it’s kind of like the regular monolithic error handling.

2. Asynchronous events

For most event sources, the Lambda invocation is happening asynchronously. It means that there isn’t any application to respond to the failure, and therefore the AWS framework takes care of that by itself. What it does is to trigger the lambda again with the same event, mostly twice in the following ~3 minutes (though in rare cases it may take up to six hours, and a different number of retries may occur). If all retries have failed, it’s often necessary that this event will be recorded and not just thrown away. Therefore, the important DLQ feature enables to configure a Dead Letter Queue over Amazon SQS that receives such events.

3. Stream-based events

Current events of this type are only Amazon Kinesis Data Streams and DynamoDB streams. AWS will trigger failing Lambda functions again and again until the data expires or processed successfully. Unlike asynchronous events, AWS will block the event source until that point.

In this post, I’ll refer mostly to the most common and problematic case of asynchronous events, though some of the given advice is relevant to the other cases as well. For a detailed explanation of retry behavior, check the AWS docs.

AWS Lambda Retry Behavior Consequences

So each Lambda might be executed several times with same input, while the “caller” actually didn’t mean or even know about it. In order to execute the same operation multiple times, the Lambda must be what’s called idempotent – meaning that no additional effect takes place when it’s run more than once with the same input.

Serverless functions are not the only example of using this term. A classic example is a network API: when a request does not get a response, the same request is sent again.

In Serverless architectures a similar case may happen when, for example, a Lambda gets a time-out before receiving such a response. Even if that is highly unexpected, in some cases an incorrect retry handling may cause severe problems as DB structure violation.

Idempotency

“Idempotence is the property of certain operations in mathematics and computer science that they can be applied multiple times without changing the result beyond the initial application” (Wikipedia).

But wait – what if we need to execute the same operation twice when it’s not a retry? For example, let’s say that the Lambda receives as input a user operation log, and is responsible for recording it on a database. In that case, we need to differentiate between a retry case and when the trigger input of the Lambda is simply the same because the user did the same operation again. A good solution for that is to refer the Lambda’s request ID as if it were part of the input itself. Only when there is a Lambda retry you will get the same ID. To extract it, use context.awsRequestId in Node.js (or the corresponding field in other languages). This method is actually the general approach to detect retry executions. Using the request ID for being genuinely idempotent is not always convenient. In the previous example, this ID should have been saved in the DB as well, so the following invocations could find whether to add a new record. Another solution may be to use some in-memory data store (as Redis), but again, it adds quite significant overhead.

Step Functions to the Rescue Error handling in AWS Lambda can be achieved in various ways, such as using wrappers. However, it turns out that AWS Step Functions is a beneficial feature when building a serverless application that deals with errors and retries properly – even becomes a crucial one. The Hitchhiker’s Guide to Step Functions provides a good tutorial overview. Motivation Let’s say that in response to an event, the application has to perform several operations. If you combine all of them to the same Lambda, the code usually has to check for each operation. Should it be redone so that the whole Lambda remains idempotent? It could be a real pain. It is important to understand the difference between our example and monolithic applications. In monolithic, the application itself could be responsible for making retries since it can wait between them, and that’s not possible in Serverless. On the other hand, with Step Functions, we can run each operation on a different Lambda. We can also define the transitions between them as suitable for the specific case. Moreover, we can control the retries behavior – their number and delay duration. That way, we can make it the most suitable for our use case. We can even disable it when it’s the right thing to do. From my experience, creating a step machine even for a single Lambda is the easiest workaround to disable unwanted retries behavior.

Implementation You may know that unfortunately, the available triggers for AWS Step Functions are rather limited. The only available triggers are API Gateway and a manual execution using the SDK. Because of that, we have created a template for a Pythonic Lambda. You could use it as a glue code to execute a state machine asynchronously as a response to any event. In short, it is just: import os import json import boto3 client = boto3.client('stepfunctions') def run(event, context): client.start_execution( stateMachineArn=os.environ['CF_MyStateMachine'], name=str(context.aws_request_id), input=json.dumps(event) ) A complete ready-to-use template is available on a public repository. To deploy this Lambda you should use the Serverless framework, with the awesome serverless-resources-env plugin in order to pass the state machine ARN easily. Make sure also to use serverless-step-functions and serverless-pseudo-parameters to define the state machine easily as in the following example: service: state-machine-invoking-example provider: name: aws region: eu-west-1 runtime: python3.6 # Specific Role for the Lambda and machine is better iamRoleStatements: - Effect: "Allow" Action: - "states:StartExecution" Resource: - "*" functions: first_step: handler: simple_lambda.run second_step: handler: another_lambda.run timeout: 5 machine_invoker: handler: state_machine_invoker.run events: - sns: 'arn:aws:sns:eu-west-1:xxxxxxxx:sns_name' custom: env-resources: - MyStateMachine stepFunctions: stateMachines: exampleMachine: name: myStateMachine definition: StartAt: firstStep States: firstStep: Type: Task Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:${self:service}-${opt:stage}-first_step TimeoutSeconds: 6 Next: secondStep secondStep: Type: Task Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:${self:service}-${opt:stage}-second_step TimeoutSeconds: 5 End: true plugins: - serverless-step-functions - serverless-pseudo-parameters - serverless-resources-env We artificially chose an SNS event to trigger the state machine. It is accessible by the initial step Lambda as input. Because we named the state machine execution as the invoker Lambda request ID – everything becomes idempotent. If a retry occurs to the invoker Lambda, AWS gives it the same request ID. Afterward, AWS also won’t execute the state machine again since it’s named the same. Theoretically speaking, the execution name of the state machine is also a part of its input. While this solution is useful in many situations, keep in mind that it also adds some complexity overhead. It affects the debugging and overall observability of the system. Things to Notice It’s important to understand the error handling mechanism of Step Functions, which is different than the Lambda’s one. For every Task state, a timeout duration could be set, so that if the Task is not finished in time an States.Timeout error is generated. This timeout is basically unlimited. However, for the typical case of a Task executing a Lambda, the case is different. The Lambda’s actual timeout duration is determined only by its own configured value. Therefore, it cannot get longer by this method. Therefore, make sure to configure the Task timeout to be equal to the Lambda’s timeout. The retries behavior of a Task is by default disabled and could be specifically configured (other than for Lambda).