We had blamed startup latency for latency that was actually caused by concurrent requests waiting on each other for use of the CPU. From our performance metrics, time spent waiting to execute because of other running requests is indistinguishable from time spent executing the request. This also means that increased latency from concurrency would appear the same as increased latency from new code paths or features—things that actually increase the cost of any individual request.

It was also becoming more apparent that the BadRequestError: Request aborted error wasn't readily explained by general slow start performance. The error comes from the body parser and specifically happens in the case that the client has aborted the request before the server has been able to fully read the request body. The client gives up and closes the connection, taking away the precious data that we needed to proceed with processing the request. It's much more likely that this would happen because we started handling a request, then had our event loop blocked by another request's rendering, and then went back to finish from where we had been interrupted only to find that the client had gone away. The request payloads for Hypernova are also quite large at a few hundred kilobytes on average, making nothing better.

We decided to solve this by using two off-the-shelf components with which we had a large amount of existing operational experience: a reverse proxy (nginx) and a load balancer (haproxy).

Reverse Proxying and Load Balancing

To take advantage of the multiple CPU cores present on our hypernova instances, we run multiple processes of hypernova via the built in Node.js cluster module. As these are independent processes, we are able to handle concurrent requests in parallel.

The problem here is that each node process is effectively occupied for the entire duration of the request including reading the request body from the client (monorail). While we can read multiple requests in parallel in a single process, this leads to interleaving of compute operations when it comes time to do the render. The utilization of the node processes becomes coupled to the speed of the client and network.

The solution² is to use a buffering reverse proxy to handle communication with the clients. For this, we use nginx. Nginx reads the request from the client into a buffer and passes the full request to the node server only after it has been completely read. This transfer happens locally on the machine over loopback or unix domain sockets, which are faster and more reliable than communication between machines.

With nginx handling reading the requests, we are able to achieve higher utilization of the node processes.

We also use nginx to handle some requests without ever having to go to the Node.js processes. Our service discovery and routing layer uses low cost requests to /ping to check connectivity between hosts. Handling this entirely in nginx eliminates a substantial source of (albeit cheap) throughput to the Node.js processes.

The next piece is load balancing. We need to make smart decisions about which Node.js processes should receive which requests. The cluster module distributes requests via round-robin³, where each process is given a request⁴ in turn. Round robin is great when there is low variance in request latency like in:

It is less great when there are different types of requests that can take radically different amounts of time to handle. Later requests on a process have to wait for all previous requests to finish, even if there is another process that has capacity to handle them.

A better distribution of these requests would look like this:

As this minimizes waiting and allows responses to be returned sooner.

This can be achieved by holding requests in a queue and only assigning them to a process once the process is no longer occupied with another request. We use haproxy for this.

When we put this in place for Hypernova, we entirely eliminated the timeout spikes on deploy as well as the BadRequestError s. Concurrent requests were also a major driver of high-percentile latency during normal operation, so this also reduced that latency. One of the consequences of this is that we went from a baseline timeout rate of 5% to a timeout rate of 2% with the same configured timeout. Going from 40% failures during deploy to 2% feels like a win. Today, users are presented with a blank loading screen far less often. Tomorrow, stability through deploys will be critical for our new renderer, which does not have Hypernova’s error fallback.

Details and Configuration

To set this up required configuring nginx, haproxy, and our node application. I have prepared a sample node app with nginx and haproxy configurations that can be used to understand this setup. These configurations are based on what we run in production, but have been simplified and modified to run in the foreground as an unprivileged user. In production, everything should be configured with your process supervisor of choice (we use runit or, increasingly, kubernetes).

The nginx config is pretty standard, with a server listening on port 9000 configured to proxy requests to the haproxy listening on port 9001 (in our setup, we use Unix Domain Sockets). It also intercepts the /ping endpoint to serve connectivity checks directly. A departure from our internal standard nginx config is that we've reduced the worker_processes to 1, as a single nginx process is more than sufficient to saturate our single haproxy process and node application. We are also using large request and response buffers as the props for our components for hypernova can be quite large (hundreds of kilobytes). You should size your buffers based on your own request/response sizes.

Node's cluster module handles both load balancing and process spawning. To switch to HAProxy for load balancing, we had to create a replacement for the process management parts of cluster . This came together as pool-hall, which is slightly more opinionated about maintaining a pool of worker processes than cluster , but is entirely out of the load balancing game. The example app demonstrates using pool-hall to start four worker processes, each listening on a different port.

The HAProxy config configures a proxy listening on port 9001 that routes traffic to the four workers listening on ports 9002 to 9005. The most important setting is maxconn 1 on each of the workers. This limits each worker to handling one request at a time. This can be seen in the HAProxy stats page (which is configured to run on port 8999)

HAProxy tracks how many connections are currently open between it and each worker. It has a limit configured via maxconn . Routing is set to static-rr (static round robin), so ordinarily each worker is given a request in turn. With the limit set, the routing proceeds with round robin, but skips over any workers that are currently at their limit of requests. If no workers are below their connection limit, the request is queued and will be dispatched to whichever worker becomes available first. This is the behavior we want.



This configuration is probably pretty close to what you wouldd want to use. There are other interesting settings (as well as the requisite copypasta) in there. As part of preparing this config, we conducted a number of tests under normal and anomalous conditions and derived configuration values from them. This gets into the weeds and is not strictly necessary to understand to use this set up, but it is presented in the next section.

HAProxy Deep Dive

There was a lot riding on the HAProxy configuration working exactly as we wanted it to. It would not do us much good if it did not handle concurrent request limiting or queueing the way we expected. It was also important to understand how various types of failures were handled (or not handled). We needed to develop confidence that this was a suitable replacement for the existing cluster setup. To verify this, we performed a series of tests.

The general shape of the tests was to use ab (Apache Benchmark) to run 10,000 requests at various levels of concurrency like

ab -l -c <CONCURRENCY> -n 10000 http://<HOSTNAME>:9000/render

Our configuration used 15 workers instead of the 4 in the example app, and we ran ab on a separate instance from the instance running the app to avoid interference between the benchmark and the system under test. We ran tests at low load (concurrency=5), high load (concurrency=13), and queueing load (concurrency=20). Queueing load ensured that haproxy was always running a queue.

The first set of tests were just normal operation, with no funny business. The next set of tests were under graceful restart of all processes as would occur during a deploy. The final set of these tests featured me killing a subset of the processes at random as would occur if uncaught exceptions were to crash a process.

Separately, infinite loops in application code had been problem, so I ran single requests against an endpoint with an infinite loop.

These tests helped shape our configuration as well as our understanding of how it would all work.

In normal operation, maxconn 1 worked exactly as hoped, restricting each process to handling one request at a time. We don't configure HTTP or TCP health checks on backends as we found that this caused more confusion than it was worth. It seems that health checks do not respect maxconn , though I have not verified this in the code. Our expected behavior is that a process is either healthy and able to serve or is not listening and will immediately give a connection error (there's one major exception to this). We found these health checks to not be sufficiently controllable to be useful for our case, and elected to avoid the unpredictability of overlapping health checking regimes.

Connection errors are something that we can work with. We set option redispatch and retries 3 , which allows requests that receive a connection error to be bounced to a different backend that will hopefully be more forthcoming. Connection refused comes immediately, allowing us to get along with our business.

This only applies to connections that are refused because we happen to not be listening. The connect timeout is not particularly useful as we are dealing with local network. We initially expected to be able to set a low connect timeout to guard against workers that have been caught in an infinite loop. We set a timeout of 100ms and were surprised when our requests timed out after 10 seconds, which was the client/server timeout set at the time, even though control never returned to the event loop to accept the connection. This is explained because the kernel handles getting the connection to established from the client's perspective before the server calls accept .

As an interesting aside, even setting a backlog does not result in a connection not being established as the backlog length is evaluated after the server responds SYN-ACK (and is actually implemented/handled by the server dropping the ACK response back from the client). One important consequence of this is that requests that have had their connections established cannot be redispatched/retried as we have no way to tell whether the backend processed the request or not.

Another interesting result from our tests on processes caught in infinite compute loops was that the client/server timeouts allow for some unexpected behavior. When a request has been sent to a process that causes it to enter into an infinite loop, the backend's connection count is set to 1. With maxconn , this does what we want and prevents any other requests from falling into the tar pit. The connection count is decremented back to 0 after the client/server timeout expires, which allows our one-in-one-out guarantee to be violated and also dooms the poor next request to failure. When a client closes the connection because of timeout or caprice, the connection count is not effected and our routing still continues to work. Setting abortonclose causes the connection count to decrement as soon as the client closes. Given this, the best course of action is to set a high value for these timeouts and keep abortonclose off. Tighter timeouts can be set on the client or nginx side.

We also found a rather ugly attractor that applies in cases of high load. If a worker process crashes (which should be very rare) while the server has a steady queue, requests will be tried on that backend, but will fail to connect as there is no process listening. HAProxy will then redispatch to the next backend with an open connection slot, which will be only the backend that previously failed (as all other backends are busy actually working). This will quickly burn through the retries and result in a failed request because connection errors are way faster than rendering HTML. The process will repeat on the rest of the requests until the queue is completely drained. This is bad, but it is mitigated by the rareness of process crashes, the rareness of running sustained queues (if you are constantly queuing, you are under-provisioned), and, in our specific case, the fact that this failure attractor would also attract our service discovery health checks, quickly marking the entire instance as unhealthy and ineligible for new requests. This is not great, but it minimizes the danger. Future work could handle this through deeper HAProxy integration where the supervisor process observes the process's exit and marks it as MAINT via the haproxy stats socket.

Another change worth noting is that server.close in node waits for existing requests to complete, but anything in the HAProxy queue will fail as the server does not know to wait for requests it has not yet received. Ensuring sufficient drain time between when an instance stops receiving requests and when it starts the server restart process should cover this in most cases.

We also found that setting balance first , which directs most traffic to the first available worker in order (basically saturating worker1 ) reduced latency on our app by 15% under both synthetic and production load vs. balance static-rr . This effect was long lived and beyond what would be easily explained by warm up. This lasted for hours after deploy. Performance degraded over longer periods of time (12 hours), likely because of memory leaks on the hot process. This was also less resilient to spikes in traffic as the cold processes were extremely cold. We still do not have a great explanation for this.

Finally, Node's server.maxConnections setting seemed like it would be helpful here (it certainly did to me), but we found that it did not really provide much utility and did cause errors at times. This setting prevents the server from accepting more than maxConnections by closing any new handles after seeing that it's above the limit. This check is applied in JavaScript, so it does not guard against the infinite loop case (we will correctly abort the request once we return to the event loop…wait). We also saw connection errors caused by it under normal operations, even as we saw no other evidence of multiple requests running. We suspect this is a slight timing issue or difference of opinion between haproxy and Node about when a connection begins and ends. The desire to have an application side guarantee of mutual exclusion is a good one as this allows developers to safely use singletons or other global state. This can be handled by implementing a per-process queue as an express middleware.

Conclusion

Server side rendering represents a different workload from the canonical, mostly I/O workload that node excels at. Understanding the cause of anomalous behavior allowed us to address it with off the shelf components with which we had existing operational experience.