ESlint is a great way to keep your project neat and tidy. We were targeting node 8.10 as that’s what Lambda supported at the time with 2016 support. I recommend adding the additional ‘experimentalObjectRestSpread’ spread operator feature here. It’s supported by 8.10 although ESlint will yell at you for using it otherwise. (The spread operator is amazing)

Function lambdas that weren’t APIs were a great way of consuming and emitting events throughout the system. Small processes such as sending an email because x happened or completing a workflow because y notification came in were written and deployed in no time. In almost all cases it was just a handler function and a state machine with one purpose.

SNS & SQS

Another tenet we had was a strong focus on event driven architecture. All communication between services and actions was to be driven by events. We ended up using the pub/sub model of SNS and SQS combined, with events being published through SNS and passed onto lambdas using SQS. I’ve spoken about this before and I’m a big fan. For our use case this model worked well and without other systems to keep in sync we were fine with how the flow worked. Replay-ability wasn’t a big priority for us so an eventstore wasn’t needed.

Services publish to an SNS topic and then that topic is subscribed to by one or more SQS queues with a Lambda attached. The lambda then consumes the event and does one of three things:

Nothing, consumes and actions the event. The event isn’t ready to be consumed so it’s placed back on the queue (In this case the 3rd party service could have a delay of several minutes to complete its action) The event errors and then is reprocessed. We used a deadletter threshold of 10 before events were placed into a separate deadletter queue, causing an alarm to go off.

We had no issues with this method of eventing and it actually really helped to create the distinctions between services.

The shared libraries point comes across strongly with these events as well. Having a well defined event contract is key for adding services that consume them. We did feel a slight bit of pain when it came to keeping these events in sync and having a shared place for these to be defined would have alleviated a lot of that.

Cognito

As part of the ‘all services must be AWS native’ we used Cognito for our security. This was my first look at Cognito so there was a lot to learn. We ended up using both Federated Identities as well as User Roles to manage our authentication and login. We had two types of users: one managed by our system and one managed by a 3rd party.

This worked fairly well although the fact that you don’t get a bearer token that is verifiable from Identity Pools was frustrating. We ended up using IAM roles to restrict our API endpoints which isn’t ideal and not standard API behaviour. There was a fair bit of education needed to have the teams consuming the endpoints across how the signing method works for AWS. User Roles do actually use a verifiable JWT but as we wanted to have the same endpoints cater to both types of users we couldn’t lean on this functionality.

In hindsight, a better approach to using IAM roles would have been to encode the IAM credentials into a JWT and then use the Lambda Authorizer to verify and pass on the credentials. This would have a saved a lot of time and kept all of our security using the same method.

The authentication workflow for our users was pretty cool though:

Initial users use an unrestricted API endpoint to generate an unauthenticated Cognito user. Return IAM credentials restricted to the next API to the consumer. This becomes the users session.

Using those credentials to access the API responsible for registration. Once registration is complete, escalate the Cognito user to be ‘developer authenticated’ and return new IAM credentials for that user to the consumer. This is the users new session.

Using the newly authenticated credentials, access the rest of the API suite.

The workflow was a bit annoying to put together. A challenge we had that necessitated this was a 3rd party integration that didn’t have an immediate response, meaning the registration could take several minutes. Ideally there wouldn’t have been a registration session at all.

There are some downsides to using IAM credentials for user sessions. One is that you can’t actually invalidate the IAM credentials without deleting the user (That I know of). Meaning that every time you use a federated identity you need to create a new Cognito user. IAM credentials also throw off every alarm a penetration tester has once they come through, even if they are restricted to only act on the APIs there are intended for.

DynamoDB

I’ve used Dynamo DB several times before this project and it’s always an interesting point. We used the strengths it carries, small transactional rows that leverage DynamoDBs powerful event streams and speed. The overall database design did lean towards being a sort of unlinked but still relational DB with several sub tables for different relationships between objects. There were tables used purely for processing and events which was great. One thing to be wary of here is adding unnecessary tables where rows could instead be extended (which don’t require a data migration).

As usual, using DynamoDB and Lambda together is seamless and a pleasure to implement.

AWS and Terraform

All infrastructure as code. By the end of the project we had ~240 assets, which is a lot. The time to deploy was around 4 minutes, which doesn’t seem like a long time, but when you have several people blocking up the build pipeline it can seem like forever. This left a pretty large window for states to be corrupted or incomplete, especially if something went askew on the build machine. The initial deployment time was actually upwards of 12 minutes, but optimisations of the project and architecture were put in place.

Each feature branch had its own stack built when commits were made, meaning each day it was common to have ~30 uniquely deployed stacks within AWS. This number doubled when the feature branch went into pull request and another ‘merged’ branch was deployed. You might be thinking at this point, isn’t that a LOT of infrastructure to have deployed? Doesn’t AWS have soft limits? Doesn’t AWS have hard limits too? Yes, yes and yes.

A couple of weeks into the project, those limits were reached. In most cases the limit could be increased, but in others AWS had a hard limit. Parts that reached their limits:

S3 Buckets

API Gateways

Lambda Functions

IAM Roles + Policies

User Pools

Teardowns of the infrastructure were a massive problem, across all teams. The API gateway design was suboptimal, with 1 gateway per API, rather than all the API’s sitting under routes within the same gateway. We had a total of 7 API gateway instances per stack. Multiply that by the amount of stacks we had at once and you could exceed 200 API Gateways daily. This is a problem because an AWS account is restricted to deleting 1 API every 30 Seconds. That’s a lot of time to tear down only one type of resource we had.

Trying to terraform destroy the whole stack

This problem coupled with the fact that S3 Buckets have to be empty to be deleted caused havoc, with a huge portion of time dedicated to cleaning up the AWS environment as efficiently as possible. With a small team of developers this wouldn’t have been too bad, but with the velocity of the project and the mono-repo design, it was unsustainable.

A new, more concise design for the API gateway with sub routes for each API was developed towards the end of the project which addressed the issue but it wasn’t implemented before MVP.

Repository: GIT

Mono. Repo. Two very nasty words. The entire project was within the same repository: The cloud system, the frontend and the QA testing suite. I am still strongly against this and believe that a separation of these systems goes a long way. Issues that arose between teams because of this:

The frontend branches also had their own redundant AWS resource stacks (Where they actually just developed against the ‘develop’ branch backend). This meant the frontend was heavily affected by build issues that were completely unrelated to their development.

Merging was a nightmare.

Coding standards at one point to were set to the root of the project with ESlint and other tooling that conflicted heavily between the frontend and backend.

The project size exploded when a remote branch had binaries added to it, causing massive delays in GIT actions.

The build exceeded 15 minutes per commit. This was with the repository cloning, terraform, frond end deployment, QA testing and docker deployments.

There were a few git rules active during the start of the project that needed to be rescinded:

Forcing the feature branch to be up to date with the latest develop commit before it was actually merged into develop. Meaning that if someone were to commit their code right as your 15 minute build went green, you were forced to merge your branch again and wait for the build. Crossing your fingers that nobody would commit before you. This is a bad idea, don’t ever do this on high-velocity projects.

Actual footage of merging just before builds finish running

PR and Branch builds had to both be green. Unfortunately when you have 15 developers all committing code at the same time getting the sweet spot for when your 15 minute build and branch are both in sync was insane. This caused a lot of velocity issues. Eventually we settled on just having the PR build green, this was a build that is automatically merged with develop so we could ensure that the merge would be OK.

The biggest issue we had with Github itself was when GitHub had their outage and that set us back an entire day.

All things considered, the Pull Request system within Github worked very well as a code review tool, once all the rules were in the right place.

Travis

The Build/CI platform we used was Travis CI. I’m not a stranger to the tooling but it was new for the project. It was used with bash scripts running the whole process. The first issue was the bash environment in Travis has a very old version of npm installed. Users who were unfamiliar with the environment were a bit stumped when the node builds were taking decades. This can be manually updated but it’s not immediately obvious.

While Travis has a very simple UI it can still be somewhat confusing for a lot of users all on the same repository. There is also a huge issue with congestion (particularly around 4:30pm). At points there can be 20+ builds queued as everyone commits for the end of day. Utilising something like a git tagging system to manually trigger builds or just relying on PR builds rather than per-commit may have been a better option in this case.

SwaggerHub

Before I joined this project, as part of speccing, all the API endpoints were defined within SwaggerHub. We did run into some issues where the SwaggerHub documentation could have used a bit more of a polish before development and integration were kicked off. There were a couple instances where certain responses still being templated or part of the initial setup and were incorrect.

IOT

There isn’t a lot I can mention about the device itself, or its purpose really. But I can talk about the systems behind it.

The general gist is, the cloud system releases events to the IOT device which then receives them and does what it needs to. The devices themselves can also raise events that require a response but the interactions are very limited.

The IOT communication was conducted using AWS IOT. There was a processes to register devices that involved a serial number provided by a supplier. Once the devices was received, the user would register the device and the platform would mark it as ‘active’.

When messages between the cloud platform and the IOT devices were sent, they used unique ID’s per device that were linked during the 3rd party registration. Getting the eventing between AWS IOT and the cloud platform running smoothly was not as difficult as anticipated, just another integration point using lambdas and SNS/SQS.

3rd Parties

For this project we had a couple of third parties with specific roles. One acted as a user registration and login provider for Type 1 users. There were some restrictions that needed to be put in place such as only being contactable through a VPN. We ended up using a lambda to proxy HTTP requests! This was the only lambda to be directly invoked by another. This wasn’t too painful and with the way the integration was separated, any issues with this provider only affected a small part of the system as a whole.

Other 3rd party integrations mostly involved notifications and small updates into our cloud platform. Pretty small, very specific integration points which were modular enough to swap in and out.

This is probably where using eventing shows one of its greatest strengths. A small outage of one integration is so insignificant when it comes to the rest of the system. For example: A 3rd party we send a bit of info after every x action just has their events back up, ready to be sent through once they are back online. This makes the system a lot more robust and much less prone to total collapse.

Mock-API

As part of our testing we needed to have automated end-to-end testing of vertical slices. This included 3rd parties. Unfortunately we couldn’t hammer 3rd party sandboxes 200 times a day with junk data. Our solution to this was to add a lambda API that took requests, first looking a file that reflected the route in s3: ‘google/profile/getUserData.json’. This file would have a json payload that could be updated to return any response you wanted to mock. Using this system we could return 200 OK and an expected payload or 400 bad request or even 200 with an unexpected payload.

Secondly it would try and load a static request from SwaggerHubs VirtServer, we had files for each 3rd party endpoint and it would be the ‘default’ response.

This was an issue when a lot of the tests would call SwaggerHubs at the same time (a test suite running 10 times concurrently). SwaggerHubs VirtServer actually has a rate limit of 10 per minute which unfortunately isn’t very stable and may serve 10 or 40 requests per minute depending on load. This actually caused intermittent failures during our testing before we figured out what was going on, which was infuriating as the initial thought process is:

How is my code SOMETIMES throwing errors, where did I go so wrong

In our development environments all URLs for 3rd parties were swapped to the mock API at build and prefixed with their name. This worked really well.

Conclusion

The velocity of this project was mind-blowing. The deadline was tight and the pressure was on. It was an incredible experience to be able to work in that sort of environment while also producing top-quality code. Although I don’t think the pace would have been sustainable over long periods of time. Anything longer than 3 months would be grueling.

The most challenging part of the project was the build pipeline and time wasted waiting for builds and issues that stopped that from happening. The learning curve for Cognito was sharp but in general most pieces within AWS fit together so well that major blockers were rare. This was the largest project I’ve worked on that was 100% AWS backed, I learned a LOT and I eagerly await the next one.

Building A Web Or Mobile App?

Crowdbotics is the fastest way to build, launch and scale an application.

Developer? Try out the Crowdbotics App Builder to quickly scaffold and deploy apps with a variety of popular frameworks.

Busy or non-technical? Join hundreds of happy teams building software with Crowdbotics PMs and expert developers. Scope timeline and cost with Crowdbotics Managed App Development for free.