As Plaid grows, so does the scale of our infrastructure. We currently run over 20 internal services and deploy over 50 code commits per day across our core services. Minimizing deployment time is therefore of vital importance to maximizing our iteration velocity. A fast deployment process allows us to rapidly ship bug fixes and run a smooth continuously deployed system.

A couple months ago, we noticed that slow deploys of our bank integration service were affecting our team's ability to ship code. Engineers would spend at least 30 minutes building, deploying, and monitoring their changes through multiple staging and production environments, which consumed a lot of valuable engineering time. This became increasingly unacceptable as the team grew larger and we shipped more code daily.

While we had plans to implement long-term improvements like moving our Amazon ECS-based service infrastructure onto Kubernetes, a fix was warranted to increase our iteration speed in the short-term. We set out to score a quick win by implementing a custom "fast deployments" mechanism.

High latency in Amazon ECS deploys

Our bank integration service consists of 4,000 Node.js processes running on dedicated docker containers managed and deployed on ECS, Amazon’s container orchestration service. After profiling our deployment process, we narrowed down the increased deployment latencies to three distinct components:

Starting up tasks incurs latency. In addition to application startup time, there is also latency from the ECS health check, which determines when containers are ready to start handling traffic. The three parameters that control this process are interval, retries, and startPeriod . Without careful health check tuning, containers can be stuck in the "starting" state even after they're ready to serve traffic.

and . Without careful health check tuning, containers can be stuck in the "starting" state even after they're ready to serve traffic. Shutting down tasks incurs latency. When we run an ECS Service Update, a SIGTERM signal is sent to all our running containers. To handle this, we have some logic in our application code to drain any extant resources before completely shutting down the service.

signal is sent to all our running containers. To handle this, we have some logic in our application code to drain any extant resources before completely shutting down the service. The rate at which we can start tasks restricts the parallelism of our deploy. Despite us setting the MaximumPercent parameter to 200%, the ECS start-task API call has a hard limit of 10 tasks per call, and it is rate-limited. We need to call it 400 times to place all our containers in production.

Approaches explored

We considered and experimented with a few different potential solutions to chip away at the global objective:

Reduce the total number of containers running in production. This was certainly feasible, but it involved a significant overhaul of our service architecture in order for it to handle the same request throughput, and more research needed to be done before such a change could be made.

Tweak our ECS configuration by modifying the health check parameters. We experimented with tightening the health check by reducing the interval and startPeriod values, but ECS would then erroneously mark healthy containers as unhealthy when they started, causing our service to never fully stabilize at 100% health. Iterating on these parameters was a slow and arduous process due to the root issue, slow ECS deployments.

and values, but ECS would then erroneously mark healthy containers as unhealthy when they started, causing our service to never fully stabilize at 100% health. Iterating on these parameters was a slow and arduous process due to the root issue, slow ECS deployments. Spin up more instances in the ECS cluster so that we can start more tasks simultaneously during a deployment. This worked to reduce deploy times, but not by very much. It also isn’t cost-effective in the long run.

Optimize service restart time by refactoring initialization and shutdown logic. We were able to shave around 5 seconds per container with a few minor changes.

Although these changes improved the overall deploy time by a few minutes, we still needed to improve the timing by at least an order of magnitude for us to consider the problem solved. This would require a fundamentally different solution.

Preliminary solution: utilizing the node require cache to “hot reload” application code

The Node require cache is a JavaScript object that caches modules when they are require d. This means that executing require('foo') or import * as foo from 'foo' multiple times will only require the foo module the first time. Magically, deleting an entry in the require cache (which we can access using the global require.cache object) will force Node to re-read the module from disk when it’s next imported.

To circumvent the ECS deployment process, we experimented with utilizing Node’s require cache to perform a “hot reload” of application code at runtime. On receiving an external trigger — we implemented this as a gRPC endpoint on the bank integration service — the application would download new code to replace the existing build, clear the require cache, and thereby force all relevant modules to be re-imported. With this approach, we were able to eliminate much of the latency present in ECS deploys and fine-tune our entire deployment process.

Over Plaiderdays — our internal hackathon — a group of engineers across various teams got together to implement an end-to-end proof of concept for what we termed "Fast Deploys". As we hacked a prototype together, one thing seemed amiss: if the Node code that downloaded new builds also tried to invalidate the cache, it wasn’t clear how the downloader code itself would be reloaded. (There is a way around this with the Node EventEmitter, but it would add considerable complexity to the code). More importantly, there was also some risk of running versions of code that were not in sync, which could cause our application to fail unexpectedly.

As we weren’t willing to compromise on the reliability of our bank integration service, this complication warranted rethinking our “hot reloading” approach.

Final solution: reloading the process

In the past, in order to run a series of uniform initialization tasks across all our services, we wrote our own process wrapper, which is aptly named Bootloader. At its core, Bootloader contains logic to setup logging pipes, forward signals, and read ECS metadata. Every service is started by passing the application executable’s path to Bootloader, along with a series of flags, which Bootloader then executes as a subprocess after performing the initialization steps.

Instead of clearing Node's require cache, we updated our service to call process.exit with a special exit code after downloading the intended deployment build. We also implemented custom logic in Bootloader to trigger a process reload of any child process that exits with this code. Similar to the “hot reload” approach, this enables us to bypass the cost of ECS deploys and quickly boostrap new code, while avoiding the pitfalls of “hot reloading”. Furthermore, having this "Fast Deploy" logic at the Bootloader layer allows us to generalize it to any other service we run at Plaid.

Here's what the final approach looks like:

Our Jenkins deployments pipeline sends an RPC request to all instances of our bank integration service, instructing them to "Fast Deploy" a specific commit hash

The application receives a gRPC request for a fast deployment and downloads a tarball of the build from Amazon S3, keyed on the received commit hash. It then replaces the existing build on the file system and exits with the special exit code that Bootloader recognizes.

Bootloader sees that the application exited with this special "Reload" exit code, and restarts the application.

Lo and Behold, the service now runs new code!

Here’s a very simplified diagram of what happens during this process.

Results

We were able to ship this “Fast Deployments” project within 3 weeks and reduce our deployment times from more than 30 minutes to 1.5 minutes across 90% of our containers in production.

The graph above shows the number of deployed containers for our bank integration service, color-coded by their commits. If you focus on the yellow line graph, you can observe a leveling off in the increase at around 12:15, which represents the long tail of our containers which are still draining their resources.

This project has greatly increased the velocity of Plaid's integrations work, allowing us to ship features and bug fixes more quickly, and minimize engineering time wasted context switching and monitoring dashboards. It is also a testament to our engineering culture of shipping materially impactful projects, embodied by ideas that come out of hackathons.

Want to work on impactful projects like Fast Deployments? We're hiring!