At VIP, we run a highly available Node service that powers much of our platform. One of the biggest challenges we see teams face is the question of how to scale a highly available API.

That’s a broad problem to solve, but let’s assume we already have adequate test coverage and everything in front of the API taken care of for us. We only care about things we can change about the Node app itself.

Our typical answer looks something like this:

Use Node’s cluster module to fully take advantage of multiple CPUs Gracefully reload worker processes for deploys and uncaught exceptions

Node Cluster

Node’s cluster module uses child_process.fork() to create a new process where communication between the main process and the worker happens over a unix socket.

The TCP module’s server.listen() function hands off most of the work to the main process, allowing child processes to act like they’re all listening on the same port.

HTTP Server Example

Let’s take a simple http server as an example. Here we have a server that listens on port 3000 by default and returns Hello World! . It also throws an uncaught exception 0.001% of the time to simulate a bug we haven’t accounted for.

const { createServer } = require ( 'http' ) module .exports = createServer( ( req, res ) => { if ( Math .random() > 0.99999 ) { throw Error ( '0.001% error' ) } res.end( 'Hello World!

' ) } ).listen( process.env.port || 3000 )

Obviously a real server would be much more complex, but this toy example will be adequate for this example. We could run this server with node server.js and we’d have an http server running on our server.

The first thing we’ll do is use Node’s cluster module to start one copy of the server per CPU, which will automatically load balance between them.

const cluster = require ( 'cluster' ) const WORKERS = process.env.WORKERS || require ( 'os' ).cpus().length if ( cluster.isMaster ) { for ( let i = 0 ; i < WORKERS; i++ ) { cluster.fork() } cluster.on( 'listening' , ( worker, address ) => { console .log( 'Worker %d (pid %d) listening on http://%s:%d' , worker.id, worker.process.pid, address.address || '127.0.0.1' , address.port ) } ); } else { const server = require ( './server' ) }

This will start one copy of the server for each CPU in our system. The operating system will take care of scheduling these processes across the CPUs.

Graceful Reload

Now that we have multiple processes, we can gracefully reload these in case of errors and for deploys.

Errors

In case of errors, we terminate the worker process and spawn a new one. This is important because an uncaught exception means the process is now in an inconsistent state. In other words, an exception occurred that was not accounted for and we’re not sure what side effects that will have.

First, we’ll ensure that worker processes are restarted if any exit unexpectedly. In the isMaster branch:

cluster.on( 'exit' , ( worker, code, signal ) => { if ( ! worker.exitedAfterDisconnect ) { console .log( 'Worker %d (pid %d) died with code %d and signal %s, restarting' , worker.id, worker.process.pid, code, signal ) cluster.fork() } } )

Here worker.existAfterDisconnect would be true if we call worker.disconnect() or worker.kill() , but false if the worker itself calls process.exit() . That becomes important in this next step, where we automatically terminate the worker process in the case of an uncaught exception.

const SHUTDOWN_TIMEOUT = process.env.SHUTDOWN_TIMEOUT || 5000 process.on( 'uncaughtException' , error => { console .log( error.stack ) server.close( () => process.exit( 1 ) ) setTimeout( () => { process.exit( 1 ) }, SHUTDOWN_TIMEOUT ) } )

We stop connecting new connections with server.close() and terminate the process with process.exit( 1 ) when all existing connections are closed. Since we want to ensure this worker is stopped within a reasonable timeframe, we force it to close after 5 seconds.

Deploys

For deploys, we gracefully reload all the worker processes one at a time to avoid any downtime in the process.

In the worker, we look for the main process to send a message that simply says “shutdown”. This again calls server.close() to stop accepting new connections and terminates the process when all active connections have closed.

const server = require ( './server' ) process.on( 'message' , message => { switch ( message ) { case 'shutdown' : server.close( () => process. exit ( 0 ) ) return } } )

Upon SIGHUP we create one new worker for each active worker and gracefully shutdown the old worker when the new one is ready to accept connections.

process.on( 'SIGHUP' , () => { console.log( 'Caught SIGHUP, reloading workers' ) for ( const id in cluster.workers ) { cluster.fork().on( 'listening' , () => { gracefulShutdown( cluster.workers[ id ] ) } ) } } )

Gracefully shutting down a worker involves a few steps.

First, we send the shutdown signal that the worker is listening for and disconnect. As mentioned before, when all the connections are closed, the worker process will terminate itself. Again, since we want to ensure this worker is stopped within a reasonable timeframe, we force it to close with worker.process.kill() after 5 seconds.

const SHUTDOWN_TIMEOUT = process.env.SHUTDOWN_TIMEOUT || 5000 const gracefulShutdown = worker => { worker.send( 'shutdown' ) worker.disconnect() const shutdown = setTimeout( () => { worker.process.kill() }, SHUTDOWN_TIMEOUT ) worker.on( 'exit' , () => clearTimeout( shutdown ) ) }

Upon SIGINT or ^C , we’ll perform a similar graceful shutdown routine. The only difference is that we don’t need to restart each worker this time.

process.on( 'SIGINT' , () => { console .log( 'Caught SIGINT, initiating graceful shutdown' ) for ( const id in cluster.workers ) { gracefulShutdown( cluster.workers[ id ] ) } } )

To prevent the initial SIGINT from propagating to worker processes and immediately terminating them, we’ll handle the signal separately there. The first one is ignored, but if you press ^C or otherwise send SIGINT twice, all threads are closed immediately, bypassing the graceful shutdown.

process.on( 'SIGINT' , () => { process.on( 'SIGINT' , () => { process. exit ( 1 ) } ) } )

I hope this was helpful. You can see the full example on GitHub.