The architecture diagram shows our final design: for each application, we provision an SQS queue that is polled by Lambdas, backed by a single shared dead letter queue and metrics queue. This provides us with the flexibility of management per application, such as updating and deleting messages in-flight, and rate limiting the throughput by setting the maximum number of concurrent Lambda executions. The dead letter queue (DLQ) is an AWS feature for handling messages that are unprocessable for any reason that will be parked in this separate queue for diagnoses. The shared metrics queue holds informational metrics such as status codes, timestamps, and delivery results.

Traffic flow is simple and straightforward in this new system. Once the message bus has computed the event and the subscriber, it forwards the information to AWS where it is enqueued in an SQS queue. A Lambda running our custom logic pulls from the queue automatically and attempts to send the event to the designated subscriber. If the send succeeds, the message is marked as completed and is deleted from the system. If the send fails, the message is requeued into SQS with a configuration that sets the appropriate retry backoff time.

The beauty of the system lies in its simplicity. Scaling, rate limiting, error handling, and basic monitoring are all done through configuration using existing tools provided by Amazon. For extra data points, we implemented additional custom metrics that are created in the Lambdas and forwarded to the metrics queue to be pulled at regular intervals for monitoring, alerting, and analysis.

Deploying Lambdas

We believe deployment should follow the best practice set by our existing workflows. This means new code is committed to a staging environment first and builds are thoroughly verified to be production ready by automated (and when appropriate, manual) acceptance tests. Access rights are enforced through internal systems and AWS’ IAM roles.

We built a powerful set of tools on top of the AWS SDK to assist local development, troubleshooting, rollouts, and remediation. For example, updating Lambdas would involve uploading the code to AWS S3, calling the ListLambda API to obtain all the Lambdas, and then the UpdateLambda API.

These tools enable the following typical workflow:

Independently test new AWS dependencies.

Create a new feature change locally.

Provision a subsystem in AWS to verify end-to-end scenarios manually.

Write tests.

Open a PR and, once approved, commit the changes.

Deploy the changes to the staging environment. Acceptance tests have to be green for the deployment to complete.

Deploy the changes to the production environment.

Monitoring and Testing

Monitoring is one of the core tenants of ensuring a successful service in production. We collect quantifiable, anonymous data to ensure our services are optimal and operational.

We have the following means of monitoring:

CloudWatch Logs . CloudWatch is an AWS service that provides active monitoring. Logging statements are printed by both the system and Lambda logic. We usually print informational and error messages for debugging purposes.

. CloudWatch is an AWS service that provides active monitoring. Logging statements are printed by both the system and Lambda logic. We usually print informational and error messages for debugging purposes. SignalFx Dashboards and Alerts. Square uses SignalFx, a SaaS-based monitoring and analytics platform. for internal monitoring and it is comprehensive — all API calls to AWS are monitored, and all traffic in AWS are monitored through either SignalFx’s direct integration with CloudWatch, or from the data we pull from the metrics queue. All the dashboards are backed by alerts that will trigger once certain conditions are met.

Unit tests are written for either individual functions or a small group of functions. They are located alongside production code following the Go testing conventions. They must pass before code can be pushed into master.

Integration and CI tests are similar to unit tests just with expanded scope. They test multiple systems or end to end systems and are gatekeepers for builds going into staging.