Máté Huszárik Engineer at RisingStack, interested in JS, Golang, .NET Core. Peter Czibik Senior Engineer at RisingStack

In this article we share the story of how we fixed a nasty bottleneck in one of our clients' large-scale Microservices infrastructure. As the result of our work, the product became able to serve 1000s of requests per second.

This post starts with some general info about this consulting project, then dives deep into a particular case study on how we solved a nasty scaling issue and tweaked the infrastructure around a Node.js application has to serve millions of users in real-time.

This case study provides insight into how we tackle problems at RisingStack, and you can also learn about investigating performance issues and scaling a microservices architecture.

Early-Day Microservices Adoption:

The concept of “Microservices” started to gain traction in the tech community in 2015, but the broad adoption was yet to arrive. Our client dared to embark on a journey that was frowned upon by many at the time. The tooling around Microservices was immensely weak, and the available reference material wasn’t perfect either.

Microservice architecture, or simply microservices, is a unique software development methodology that has gained many followers during the last couple of years. In fact, even though there isn’t a whole lot out there on what it is and how to do it, for many developers, it has become a preferred way of creating enterprise applications.

Thanks to its scalability, the microservices method is considered ideal for serving high load with high availability.

While there is no standard, formal definition of microservices, there are specific characteristics that help us identify the style. Essentially, microservice architecture is a method of developing software applications as a suite of independently deployable, small, modular services in which each service runs a unique process and communicates through a distinct, lightweight mechanism to serve a business goal.

The Scope of Collaboration

We started actively contributing to the development of our clients' microservice architecture in Q2 of 2015. Our job involved developing and maintaining a small number of services for them. The first phase of our cooperation lasted for about a year and resulted in 3-4 new microservices apps built with Node.js.

In September 2017, our client reached out to us again with plans for long-term cooperation, involving one of their promising products which enables marketers to deliver messages to mobile applications and can be plugged into our clients' already existing marketing automation tools.

Serving Millions of Devices in Real-Time with Node.js

The already mature, although ever-growing smartphone market demands high-performance applications to serve millions of devices all over the world. The product we worked on required real-time integration with already existing apps that have a massive number of users, who are in need of a highly responsive application experience.

What's the challenge? The stack involved a few constraints.

The application itself was hosted on Heroku, which provides an easy “plug and play” experience for developers looking to deploy their applications, but it is also difficult to work with when it comes to obtaining real performance numbers during benchmarking, utilizing hardware and in a few other areas too.

The original problem our client had was that as they grew, they acquired new customers that put a higher load on the application than it could handle in its state at the time.

They set the numbers; we made sure they can deliver them.

Our task was to increase the performance of the back-end to serve millions of devices in real-time without a glitch.

Solving a Nasty Scaling Issue

The first step in solving any scaling issue is to gather all the symptoms. First of all, one of the customers reported that they get 5XX response status codes for their requests from the servers, which indicates server failure. After the initial load testing done by the QA team, we knew that at around 60-80 request per second per Heroku dyno was the hard limit.

Finding the Faulty App Layer

When looking for an issue like ours, it is advised to analyze every independent layer of the application from top to bottom. This way, the source of the problem can be pinned to a specific layer of the application.

We started investigating the application response time on Heroku with one dyno. The server was repeatedly stress tested with JMeter to see an average time result of one HTTP serve. We relied on JMeter’s output and Heroku’s p95 and p99 metric results to see the endpoints performance on a various amount of load.

The measurements confirmed that while launching a high number of requests (100-400 RPS), the execution time of one API call exceeded 1000 ms.

We saw that

Heroku’s router could not keep up with the incoming HTTP requests because the one dyno was busy processing the previous ones.

The load of the dyno increased to 4-6x times above the optimal value.

Memory consumption seemed to be quite normal.

The API endpoint executed three PostgreSQL queries before returning a JSON result.

The next step was to measure the round trips from the application to the database server.

The database server was hosted on a different provider, so the network latency was high (70-90 ms), but it still didn’t explain the enormous response time and the high CPU load. Also, the queries were broken into instructions with a Postgres EXPLAIN ANALYZE command to see if any optimization can be achieved on the execution.

We could not achieve better performance on the query execution, but we noted that two identical queries return the same result every time.

All in all, the query execution could not be the source of the problem.

Creating an Instrumentation Tool

Proceeding on, we created an instrumentation tool to get an insight on the rest of the application. It was a plain logger extension that identified parts of the application and wrote every piece of information to the standard output.

The output was visualized on Librato, so we were able to analyze the different execution times on area charts. The charts represented connection pool statistics, SQL query execution time and added all parts of the business logic as well.

Thanks to this information we found out that the authorization middleware of the application could take up 60% of the request time on extreme load. Following this trail, we started to request CPU profiles on the application server while sending a large number of sample requests. We used Chrome DevTools’ remote JS CPU profiler for this purpose that we attached to the running instance on Heroku.

The outcome was odd because there was a line called Program that took a long time to run. The application server under investigation spawned child processes using the throng library that could not be recognized by the profiler.

After removing this library and running the application only on one thread, we could get a clear result of what is actually happening under the hood.

Finding the Real Cause

The outcome has confirmed our suspicion on the pbkdf2 decryption in the authentication middleware. The operation is so CPU heavy that over time it takes up all the free time of the processing unit. Heroku’s dynos on shared machines couldn’t keep up with the continuous decryption tasks in every request, and the router could not pass the incoming requests until the previous ones have not been processed by any of the dynos.

The requests stay in the router’s queue until they have been processed or they get rejected after 30 seconds waiting for the dyno. Heroku returns HTTP 503 H13 - Connection closed without a response - which is the original symptom of the issue we were hired to fix.

We shared this information with the maintainers of the application so they could evaluate the result. As we had previous experience with the application stack, we knew that this service only received authenticated requests, so the middleware seemed to be redundant. This hypothesis was confirmed by the client, and the encryption was removed.

The upcoming results showed improvement of RPS and dyno load. The application running on one dyno could stably serve 80-100 RPS while the load was between 0.8-1.4. It was still a little bit over the optimal value because of the many JSON.parse operations, but this was inevitable due to the data column that stored JSON arrays.

This was the point where we could start scaling the application horizontally.

Scaling the Microservices Application Horizontally

First, we instrumented the database connection library so the number of database connections could be visualized. This way, the application could be spawned on multiple dynos while we were able to monitor the load they put on the database.

Initially, the server had a fixed number of max connections to the database, but we switched the library with knex.js to configure connection limits. This allowed us to experiment with various connection numbers on one dyno to see how many connections the server can utilize.

In the meantime we figured out that 2 out of 3 queries were identical, returning the same result every time which made unnecessary database round trips. A simple in-memory cache was also introduced to lower the database latency.

These steps increased the RPS to 200-240 with one dyno.

Up to 800-1000 RPS apache’s benchmark tool served well, but beyond that we needed multiple machines to stress test the endpoint on Heroku. For this reason, we set up numerous JMeter slaves on DigitalOcean which could send a high load of requests parallelly.

The application was tested with a different number of connection pools and dynos to see what is the optimal formation to serve a high number of requests without getting error messages. The following chart summarizes the results of the measurement:

Number of web dynos Max connection pool / dyno Average RPS Max RPS 1 200 292 350 2 200 482 595 4 100 565 1074 4 100 837 1297 8 50 1145 1403 8 50 1302 1908 16 30 1413 1841 16 30 1843 2559 16 30 2562 3717 20 25 2094 3160 24 20 2192 2895 24 20 2889 3533 30 16 2279 2924 36 14 2008 3070 36 14 3296 4014

The first goal was to achieve at least 2000 RPS which was surpassed by the load tests. It could be achieved with 16 dynos each with up to 30 database connections.

We went further to see what are the limits of scaling and what is the best result we can get with the current setup. It turned out that the next bottleneck was the available number of connections to the database. All in all, the used Postgres database provided at most 500 connections and after 2500-2800 RPS load the execution time of the queries increased from 6-7 ms to 12-15ms.

The initial goal was achieved, and we also stated that the number of connections to the database had to be increased for further improvement.

What we Achieved

By the end of the project, we managed to speed up a single user-facing system fivefold. Our client now has customers that can serve thousands of concurrent users who use their products.

We at RisingStack came to possess a deeper understanding of the potential performance issues that can arise in any other Node.js service - be it web, worker or alike, and apply these principles of design to our next customer’s codebase with more confidence and agility.

Final Thoughts on Building Apps with Node.js

The most important outcome of this case study is to understand the limitation of our software, as sailors say: it is most important to understand when it is time to sail and when not to.

Node.js as a platform has a few limitations on its own that we have to accept. However, with proper logging, monitoring, in-depth understanding of platforms and tooling you can scale & serve millions of customers in real-time.

We have already invested time and effort into research and development on bleeding edge software to avoid such problems in the future. We at RisingStack, with years of Node.js expertise behind our backs have learned these the hard way, so our future customers won’t have to.