illustration Adrien Griveau

Of unpredictable bugs

You cannot write software code without introducing bugs. However, regardless of how badly some of them might break your app, the bug that everyone hates the most is the one that only happens randomly. The one that stubbornly refuses to manifest itself while you’re debugging and that leaves you staring blankly at your code while the rest of the team wonders how such a blatant issue can elude you.

Some bugs are indeed relatively well-behaved and come with a recipe: repeatable and tedious steps that will get them fixed. These bugs can be put in a project roadmap, and I call them predictably fixable. Rather than striving for raw performance, or having the most concise code, I try to minimize the number of unpredictable bugs I will encounter. To some extend, the technology you pick will define that.

When I first heard about node.js, what I found the most groundbreaking was the promise of making real-time web apps behave predictably. Network code traditionally relies on threads: while the OS can give you the illusion that two parts of your code are executing simultaneously (talking to two users), it is actually constantly switching between them. This makes it impossible to know exactly in which order your code runs, and of course introduces a whole class of problems that are extremely hard to track down.

On the other hand, node.js provides you with a non-blocking network API, making it possible to write your app while using only one thread. The great thing is that the state of your app never, ever changes while your code is running.

Don’t get me wrong, async I/O has been around for a long time, and node.js is not the only solution to write real-time apps.

On the lack of context in node.js

Now, of course, node.js brings new problems. Some people will complain about callback hell, but what I found to be really problematic was the lack of context. Let’s take an example:

function foo() {

console.log('foo');

setTimeout(bar, 2000);

} function bar() {

console.log('bar');

setTimeout(oops, 3000);

} function oops() {

console.log('oops');

throw new Error('oops');

} foo();

Nothing interesting, really: foo sets a timer, which calls bar after 2s. Then bar does the same and finally oops crashes. You might expect that node.js will give you a nice stack trace explaining how oops was called, or at least mentioning foo and bar. What you get instead is this:

foo

bar

oops oops.js:13

throw new Error('oops');

^

Error: oops

at oops [as _onTimeout] (oops.js:13:9)

at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)

I can get worse. There seems to be hordes of node.js devs roaming Stackoverflow, wondering why their app dies randomy because of this:

events.js:72

throw er;

^

Error: read ECONNRESET

at errnoException (net.js:900:11)

at TCP.onread (net.js:555:19)



Again, no helpful stack trace to be seen. Of course, the explanation is quite simple. To be able to open millions of connections, node.js carries as little context as possible. It’s a necessary tradeoff, not the result of a huge engineering oversight.

Event loop redux

With a classic approach, every time you access the filesystem, send a network request or resolve a domain name, your code will block: pause for millions of CPU cycles, an extremely long time that your server would waste doing nothing if it could not switch to other tasks.

The reason why people call node.js async if because it offers a non-blocking alternative to all blocking calls. In my example, setTimeout() replaces sleep(). Async calls expect a callback: a function that is called once the task is finished, but otherwise return immediately. It’s easy to understand if you add more code:

function foo() {

console.log('entering foo');

setTimeout(bar, 2000);

console.log('exiting foo');

} function bar() {

console.log('bar');

} console.log('before foo');

foo();

console.log('after foo'); /*

Result: before foo

entering foo

exiting foo

after foo

bar

*/

So, the code exits immediately, and bar is still called some time later. Because you never get the opportunity to block, your code will always return quickly (in most cases anyway). node.js has an internal data structure to track all the pending tasks and their associated callbacks: the famed event loop.

After you give a job to node.js (for example, send this HTTP request, and call this when you get the reply), you yield back control to the loop. Most of the time, your code will not be running at all, it will just be expecting callbacks to be called. They will in turn probably schedule more jobs and give more callbacks to the loop.

When the loop calls you, it starts from a fresh stack. That’s why in node.js, stack traces only go back to the most recent async call, which is usually not very useful.

Introducing domains

The thing is, when a problem happens, you don’t necessarily need a stack trace. Rather, what you need is a way to get some context about what lead you where you are. In most technologies, this takes the form of a stack trace - in most technologies, O.K…

A typical node.js app is doing a lot of things at a given moment. If it’s an Express app, it might be busy responding to dozens of requests at a given time. Since your app is idling most of the time (waiting for the disk or a database to finish replying), the only thing the loop knows about each request is its next callback. It actually doesn’t know anything about individual requests, it just has a socket that is tied in some way to a callback.

If what your app writes something to the filesystem and then fetches something from a database, from the point of view of the event loop, here’s what’s happening:

While you can see that the app is receiving requests, doing some I/O and sending replies, the loop cannot tie individual job to a specific request. That is, unless you introduce domains.

node.js domains were introduced with node.js v0.8 in June 2012:

Domains provide a way to handle multiple different IO operations as a single group.

The way I see it, domains are a way to some kind of a tag that you can attach to an async operation. This domain then propagates to all subsequent operations started by this initial op. If you were to attach a domain to the first request, then the tasks needed to build the reply and also tied to that domain:

The most common use-case of domains is to catch uncaught exceptions:

var domain = require(‘domain’); function foo() {

console.log('foo');

var d = domain.create(); // not needed, but useful to debug domains

d.name = 'foo'; d.on('error', function (e) {

console.error('foo() triggered an error', e.message);

process.exit(-1);

}); d.enter(); // we are now in domain d

// all async ops will also be attached to d.

setTimeout(bar, 2000); // stop attaching async ops to d.

d.exit();

} function bar() {

console.log('bar');

setTimeout(oops, 3000);

} function oops() {

// you can always retrieve the current domain

var d = process.domain;

console.log('I am in domain:', d.name);

console.log('oops');

throw new Error('oops');

} foo();

Another very common use-case is to attach a domain to all requests with a middleware. If your server is handling dozens of concurrent connections and one of them causes an exception, domains will help identify the one responsible for the error and let the others finish before respawning gracefully.

module.exports = function () {

return function (req, res, next) {

var d = domain.create();

d.path = req.path; d.add(req);

d.add(res); d.on('error', function (err) {

// retrieve a context (user info, timestamps, etc.)

// to help understand what went wrong.

var context = retrieveContext(d); log.crit('[crash]', context, err); // a better solution would be to let other requests complete

// as explained in the node.js doc

process.exit(-1);

}); d.run(next);

};

};

You might be tempted to just log the error and continue, but because your code crashed, it’s in an unpredictable state. For example, it might have been prevented from returning a database connection back to its pool. You can make this relatively painless by having several instances of your app run in a cluster.

The domain doc explains how this work really well.

Domains and predictable bugs

The reason why I find domains really powerful is because they give you the ability to add context at will. Rather than being forced to make educated guesses about how your code is behaving, you have repeatable steps to track down issues.

Performance

For example, if some requests are slow and you don’t know why, here’s what you can do:

var d = domain.create(); d.timestamps = [];

d.start = Date.now(); d.tick = function (name) {

var elapsed = Date.now() — d.start;

d.timestamps.push({name: name, elapsed: elapsed}); res.setHeader('X-Timestamps', JSON.stringify(d.timestamps));

};

Now, you can add process.domain.tick(‘before op xyz’); anywhere in your code and timestamps will appear in the headers of your response.

Memory leaks

Now that each logical task in your code is linked with a domain, you can track the lifecycle of your tasks. When a task is supposed to be done (the response has been sent), you can mark its domain as closed. Then, if you try to access a resource while in a closed domain, you know that your code is leaking.