One of the big reasons people are drawn to Node.js for running web servers is the simplicity you gain from a single-threaded paradigm, versus having to deal with the challenges of threaded programming – race conditions, deadlocks, etc.

However, one drawback of running a web server on a single thread means it also runs on a single core. With nearly all production servers now operating on multiple cores, this could potentially be a big waste of available resources. Node.js comes prepackaged with a clustering API that is very easy to use and can help take advantage of more of these resources.

When you turn your Node.js server into a cluster and span it over multiple processes, you also gain some added reliability. With only a single process running, it is common practice to use a utility like forever in order to restart processes that may have exited for unexpected reasons (uncaught errors, for example). During the time the application is crashing and restarting, your server will be unable to serve requests. Clustering helps to solve this problem because the “master” process knows when one of its child processes has failed, and will automatically route future requests to other child processes while the failed process is restarted.

Adding Clustering to Your Existing Node.js Server

Fortunately, it is easy to add clustering to an existing Node.js server application. Let’s start with a very basic single process Express.js application:

var express = require('express'); var app = express(); app.get('/', function(req, res) { res.send('Hello World!'); }); var server = app.listen(3000, function() { console.log('Server started on port 3000'); });

The following commands can be used to start the server, assuming the above is named server.js:

npm install express node server

Note: the npm install express command only needs to be run once. Control+C stops the server. It might be necessary to run node via sudo on *nix machines since it opens a port.

This very basic sample starts a web server on port 3000 and responds with “Hello World!” upon visiting http://localhost:3000.

The idea behind clustering in Node.js is pretty simple. When the application is first started, it considers itself the master process. The master process then creates one or more child processes. Typically, the number of child processes started is equal to the number of CPUs installed in the system, but this is entirely up to you.

The master process itself ideally should not be a server – it should only be responsible for creating and maintaining the child processes. This is because if the master process fails, all of its children are also shut down.

In the following example, our original application is modified to start the same number of web servers as there are CPUs (no additional external npm packages are necessary for the rest of the examples):

var cluster = require('cluster'); if (cluster.isMaster) { var numCPUs = require('os').cpus().length; for (var i = 0; i < numCPUs; i++) { cluster.fork(); } } else { var express = require('express'); var app = express(); app.get('/', function(req, res) { res.send('Hello World!'); }); var server = app.listen(3000, function() { console.log('Server started on port 3000'); }); }

Upon restarting your application, you should notice the console now says “Server started on port 3000” once for each CPU installed in your system. The cluster module automatically forwards requests to the first idle worker. Neat!

Handling Unexpected Errors and Other Bad Things

As much as we may try to handle all error scenarios, in large, real world applications something is likely to explode. Let’s add a new route to our application that causes an uncaught exception to occur:

var cluster = require('cluster'); if (cluster.isMaster) { var numCPUs = require('os').cpus().length; for (var i = 0; i < numCPUs; i++) { cluster.fork(); } } else { var express = require('express'); var app = express(); app.get('/', function(req, res) { res.send('Hello World!'); }); app.get('/explode', function(req, res) { setTimeout(function() { res.send(this.wont.go.over.well); }, 1); }); var server = app.listen(3000, function() { console.log('Server started on port 3000'); }); }

Now, visiting the page at http://localhost:3000/explode will cause one of the child processes to error, and you will see a stack trace in your console. Your application is still running, but if you load the explode page too many times, you will have killed all of the child processes, and the application will exit.

Note: If you’re wondering why the error causing code is inside a setTimeout, it’s because Express.js catches errors inside route handlers and gracefully recovers. However, it is not capable of detecting errors inside of asynchronous callbacks due to scoping.

It is very simple to detect failed processes and to restart them. To add this functionality, we only have to add a small snippet of code:

var cluster = require('cluster'); if (cluster.isMaster) { var numCPUs = require('os').cpus().length; for (var i = 0; i < numCPUs; i++) { cluster.fork(); } cluster.on('exit', function() { console.log('A worker process died, restarting...'); cluster.fork(); }); } else { var express = require('express'); var app = express(); app.get('/', function(req, res) { res.send('Hello World!'); }); app.get('/explode', function(req, res) { setTimeout(function() { res.send(this.wont.go.over.well); }, 1); }); var server = app.listen(3000, function() { console.log('Server started on port 3000'); }); }

Now, no matter how many times you explode the application, it continues to run. Also neat!

Measuring Performance Gains

There are many tools out there for measuring server performance, but one very simple and free tool that I like to use is called siege (this link for Windows).

While testing performance, the most obvious thing to test is server response time, which is what I’ll test here. However, during a critical deployment, it is probably also a good idea to monitor the increased demands on memory, file I/O, etc. that will surely come with adding clustering to your solution.

Going back to the very first sample running only a single thread, on my i7 with eight logical cores, running the command

siege -b -t20s http://localhost:3000

results in:

Transactions: 44194 hits Availability: 100.00 % Elapsed time: 19.21 secs Data transferred: 0.51 MB Response time: 0.01 secs Transaction rate: 2300.33 trans/sec Throughput: 0.03 MB/sec Concurrency: 14.45 Successful transactions: 44194 Failed transactions: 0 Longest transaction: 0.11 Shortest transaction: 0.00

Running the same command on our second example with clustering added in gives us this:

Transactions: 48046 hits Availability: 100.00 % Elapsed time: 19.37 secs Data transferred: 0.55 MB Response time: 0.00 secs Transaction rate: 2480.82 trans/sec Throughput: 0.03 MB/sec Concurrency: 5.73 Successful transactions: 48059 Failed transactions: 0 Longest transaction: 0.03 Shortest transaction: 0.00

A roughly 8.7% gain in performance. Not too impressive considering we jumped from one process to eight, but the application is so simple that this is a very poor test case. The more CPU bound our application becomes, the greater the benefits appear.

Let’s change our Hello World! route to contain some wasteful CPU work:

app.get('/', function(req, res) { for (var a = 0; a < 999999; a++) { // this is pretty wasteful } res.send('Hello World!'); });

Our siege command on a single process now results in:

Transactions: 15932 hits Availability: 100.00 % Elapsed time: 19.41 secs Data transferred: 0.18 MB Response time: 0.02 secs Transaction rate: 820.64 trans/sec Throughput: 0.01 MB/sec Concurrency: 14.79 Successful transactions: 15932 Failed transactions: 0 Longest transaction: 0.03 Shortest transaction: 0.01

And run again on our clustered example:

Transactions: 34479 hits Availability: 100.00 % Elapsed time: 19.38 secs Data transferred: 0.39 MB Response time: 0.00 secs Transaction rate: 1779.38 trans/sec Throughput: 0.02 MB/sec Concurrency: 7.93 Successful transactions: 34489 Failed transactions: 0 Longest transaction: 0.07 Shortest transaction: 0.00

A 116% increase in performance! Node.js is not an ideal choice for CPU-bound solutions (and so this is not a very realistic test either), so real world results should fall somewhere in between the giant gulf of results presented here. If you have a lot of CPU-intensive work to do, a better solution might be to proxy the work over to another physical machine via HTTP or redis pub/sub, but that’s a whole different topic.

One other note: when testing a real world application, it is probably wise to run the siege command on multiple client machines against the server, instead of on the same single machine running the server as I have done.

Additional Information

The Node.js documentation on clustering is pretty thorough and rather easy to read. I would suggest reading through it if you are interested in clustering, as you will find that there are many more features available than I have presented. For example, it is possible for processes to communicate with each other via events provided by the cluster module.

One shortcoming of the examples I have provided is that while the processes do restart, they do not manage to send back an HTTP 500 status response before exiting. One solution to this would be to handle errors via Node’s global error handler, however, this is a discouraged practice. Instead, it is recommended to use one of their other modules, domain.

Even though domains can be used as a catch-all, it is still recommended to exit your child process upon unexpected errors. In the link provided above, I would strongly encourage reading and understanding the reasoning behind the section entitled “Warning: Don’t Ignore Errors!” regardless of whether you choose to use domains.

One Last Thing…

One thing I had to try and was pleasantly surprised in the result with was whether making changes to your server code would be picked up upon calls to cluster.fork(). Turns out, it does pick up any changes made.

What this means is that if you want to upgrade your server’s code while maintaining 100% uptime, you could introduce some mechanism into your master process to shut down workers and bring them back up one by one with the new code. I haven’t actually implemented this myself, but it sounds cool in theory.

Whether or not you think this is as cool as I do, it’s an important concept to remember because this behavior could also cause problems if you aren’t careful. For example, if you have started to deploy new code in anticipation of an update, and then one of the worker processes fails and restarts with the new code before you had intended for the new code to go live. That could be a very bad thing.

Thanks for reading!

— Dave Elton, [email protected]