marbot forwards alerts from AWS to your DevOps team via Slack. marbot was one of the winners of the AWS Serverless Chatbot Competition in 2016. Today I want to show you how marbot works and what we learned so far.

Let’s start with the architecture diagram.



The diagram was created with Cloudcraft - Visualize your cloud architecture like a pro.

Architecture

The marbot API is provided by an API Gateway. We get most of your requests from:

The API Gateway forwards HTTP requests to one of our Lambda functions. All of them are implemented in Node.js and store their state in DynamoDB tables.

One special case is the Slack Button API. When you press a button in a Slack message, marbot has 3 seconds to respond to this message. To respond to a button press, marbot may need to make a bunch of calls to the Slack API.

Learnings

Decoupling the process

We learned that we miss the 2-second timeout very often by looking at our CloudWatch data. To not miss the 2-second timeout, we now only put a record into a Kinesis stream that contains all relevant data before we respond to the API request. Writing to Kinesis is a quick operation, and we haven’t seen 2-second timeouts since we switched to Kinesis streams.

As soon as possible we read the Kinesis stream and process the records within a Lambda function. Kinesis comes with its challenges. If you fail to process a record, the Lambda Kinesis integration will retry this record as long as the record is deleted from the stream. All the newer records will not be processed until the failed record is deleted or you fix the bug!

We also thought about using SQS, but:

there is no native SQS Lambda integration

we can not build one on our own that is serverless and responds within a second

So we decided to use Kinesis knowing that an error can stop our whole processing pipeline.

Resilient remote calls

HTTP requests are hard. A lot of things can go wrong. Two things that we learned early when talking to the Slack API:

Set timeouts: We use 3 seconds at the moment and think about reducing this to 2 seconds Retry on failures like timeouts or 5XX responses.

Our Node.js implementation of Slack API calls relies on the requestretry package:

const requestretry = require ( 'requestretry' );

const AWSXRay = require ( 'aws-xray-sdk' );



function invokeSlack ( method, qs, cb ) {

requestretry({

method: 'GET' ,

url: `https://slack.com/api/ ${method} ` ,

qs: qs,

json: true ,

maxAttempts: 3 ,

retryDelay: 100 ,

timeout: 3000 ,

httpModules: {

'http:' : AWSXRay.captureHTTPs( require ( 'http' )),

'https:' : AWSXRay.captureHTTPs( require ( 'https' ))

}

}, function ( err, res, body ) { });

}



The following screenshot shows a X-Ray trace where the code retried Slack API calls because of the 3 seconds timeout.