Perhaps it might be better to state that there were a number of opportunities to learn from the lessons of nanomsg, as well as lessons we learned while building NNG itself.

Now to be fair, Martin Sustrik had the best intentions when he created the state machine model around which nanomsg is built. I do think that from experience this is one of the most dense and unapproachable parts of nanomsg, in spite of the fact that Martin’s goal was precisely the opposite. I consider this a "failed experiment" — but hey failed experiments are the basis of all great science.

The state machines also make fairly linear flow really difficult to follow. For example, there is a state machine to read the header information. This may come a byte a time, and the state machine has to add the bytes, check for completion, and possibly change state, even if it is just reading a single 32-bit word. This is a lot more complex than most programmers are used to, such as read(fd, &val, 4) .

There is another problem too — the inproc code that moves messages between one socket and another was incredibly racy. This is because the two sockets have different locks, and so dealing with the different contexts was tricky (and consequently buggy). (I’ve since, I think, fixed the worst of the bugs here, but only after many hours of pulling out hair.)

Worse, these state machines are designed to be run from a single worker thread. This means that a given socket is entirely single threaded; you could in theory have dozens, hundreds, or even thousands of connections open, but they would be serviced only by a single thread. (Admittedly non-blocking I/O is used to let the OS kernel calls run asynchronously perhaps on multiple cores, but nanomsg itself runs all socket code on a single worker thread.)

What I ran into in nanomsg, when attempting to improve it, was a challenging mess of state machines. nanomsg has dozens of state machines, many of which feed into others, such that tracking flow through the state machines is incredibly painful.

Sadly, this initial effort, while it worked, scaled incredibly poorly — even so-called "modern" operating systems like macOS 10.12 and Windows 8.1 simply melted or failed entirely when creating any non-trivial number of threads. (To me, creating 100 threads should be a no-brainer, especially if one limits the stack size appropriately. I’m used to be able to create thousands of threads without concern. As I said, I’ve been spoiled. If your system falls over at a mere 200 threads I consider it a toy implementation of threading. Unfortunately most of the mainstream operating systems are therefore toy implementations.)

While nanomsg is mostly internally single threaded, I decided to try to emulate the simple architecture of mangos using system threads. (mangos benefits greatly from Go's excellent coroutine facility.) Having been well and truly spoiled by illumos threading (and especially illumos kernel threads), I thought this would be a reasonable architecture.

In retrospect, OpenSSL wasn’t the ideal choice for an SSL/TLS library, and we have since chosen another ( mbed TLS ). Still, we needed an abstraction model that was better than just file descriptors for I/O.

Most of the underlying I/O in nanomsg is built around file descriptors, and it’s internal usock structure, which is also state machine driven. This means that implementing new transports which might need something other than a file descriptor, is really non-trivial. This stymied my first attempt to add OpenSSL support to get TLS added — OpenSSL has it’s own struct BIO for this stuff, and I could not see an easy way to convert nanomsg's usock stuff to accommodate the struct BIO .

Poll

In order to support use in event driven programming, asynchronous situations, etc. nanomsg offers non-blocking I/O. In order to make this work for end-users, a notification mechanism is required, and nanomsg, in the spirit of following POSIX, offers a notification method based on poll(2) or select(2) .

In order for this to work, it offers up a selectable file descriptor for send and another one for receive. When events occur, these are written to, and the user application "clears" these by reading from them. (This is done on behalf of the application by nanomsg's API calls.)

This means that in addition to the context switch code, there are not fewer than 2 extra system calls executed per message sent or received, and on a mostly idle system as many as 3. This means that to send a message from one process to another you may have to execute up to 6 extra system calls, beyond the 2 required to actually send and receive the message.

Its even more hideous to support this on Windows, where there is no pipe(2) system call, so we have to cobble up a loopback TCP connection just for this event notification, in addition to the system call explosion.

There are cases where this file descriptor logic is easier for existing applications to integrate into event loops (e.g. they already have a thread blocked in poll() .)

But for many cases this is not necessary. A simple callback mechanism would be far better, with the FDs available only as an option for code that needs them. This is the approach that we have taken with NNG.