This document was an attempt at understanding how best to port Node.js to Windows. The result of the port was the library libuv, which (among other things) provides a unified interface for asynchronous networking on the three big operating systems: Linux, OSX, and Windows.

Asynchronous I/O in Windows for Unix Programmers

This document assumes you are familiar with how non-blocking socket I/O is done in Unix.

The syscall select is available in Windows but select processing is O(n) in the number of file descriptors unlike the modern constant-time multiplexers like epoll which makes select unacceptable for high-concurrency servers. This document will describe how high-concurrency programs are designed in Windows.

Instead of epoll or kqueue, Windows has its own I/O multiplexer called I/O completion ports (IOCPs). IOCPs are the objects used to poll overlapped I/O for completion. IOCP polling is constant time (REF?).

The fundamental variation is that in a Unix you generally ask the kernel to wait for state change in a file descriptor's readability or writablity. With overlapped I/O and IOCPs the programmers waits for asynchronous function calls to complete. For example, instead of waiting for a socket to become writable and then using send(2) on it, as you commonly would do in a Unix, with overlapped I/O you would rather WSASend() the data and then wait for it to have been sent.

Unix non-blocking I/O is not beautiful. A principle abstraction in Unix is the unified treatment of many things as files (or more precisely as file descriptors). write(2) , read(2) , and close(2) work with TCP sockets just as they do on regular files. Well—kind of. Synchronous operations work similarly on different types of file descriptors but once demands on performance drive you to world of O_NONBLOCK various types of file descriptors can act quite different for even the most basic operations. In particular, regular file system files do not support non-blocking operations. (Disturbingly no man page mentions this rather important fact.) For example, one cannot poll on a regular file FD for readability expecting it to indicate when it is safe to do a non-blocking read. Regular file are always readable and read(2) calls always have the possibility of blocking the calling thread for an unknown amount of time.

POSIX has defined an asynchronous interface for some operations but implementations for many Unixes have unclear status. On Linux the aio_* routines are implemented in userland in GNU libc using pthreads. io_submit(2) does not have a GNU libc wrapper and has been reported to be very slow and possibly blocking. Solaris has real kernel AIO but it's unclear what its performance characteristics are for socket I/O as opposed to disk I/O. Contemporary high-performance Unix socket programs use non-blocking file descriptors with a I/O multiplexer—not POSIX AIO. Common practice for accessing the disk asynchronously is still done using custom userland thread pools—not POSIX AIO.

Windows IOCPs does support both sockets and regular file I/O which greatly simplifies the handling of disks. For example, ReadFileEx() operates on both. As a first example let's look at how ReadFile() works.

typedef void* HANDLE; BOOL ReadFile(HANDLE file, void* buffer, DWORD numberOfBytesToRead, DWORD* numberOfBytesRead, OVERLAPPED* overlapped);

The function has the possibility of executing the read synchronously or asynchronously. A synchronous operation is indicated by returning 0 and WSAGetLastError() returning WSA_IO_PENDING . When ReadFile() operates asynchronously the the user-supplied OVERLAPPED* is a handle to the incomplete operation.

typedef struct { unsigned long* Internal; unsigned long* InternalHigh; union { struct { WORD Offset; WORD OffsetHigh; }; void* Pointer; }; HANDLE hEvent; } OVERLAPPED;

overlapped->hEvent

GetQueuedCompletionStatus()

Simple TCP Connection Example

To demonstrate the use of GetQueuedCompletionStatus() an example of connecting to localhost at port 8000 is presented.

char* buffer[200]; WSABUF b = { buffer, 200 }; size_t bytes_recvd; int r, total_events; OVERLAPPED overlapped; HANDLE port; port = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, NULL, 0); if (!port) { goto error; } r = WSARecv(socket, &b, 1, &bytes_recvd, NULL, &overlapped, NULL); CreateIoCompletionPort(port, &overlapped.hEvent, if (r == 0) { if (WSAGetLastError() == WSA_IO_PENDING) { /* Asynchronous */ GetQueuedCompletionStatus() if (r == WAIT_TIMEOUT) { printf("Timeout

"); } else { } } else { /* Error */ printf("Error %d

", WSAGetLastError()); } } else { /* Synchronous */ printf("read %ld bytes from socket

", bytes_recvd); }

Previous Work

Writing code that can take advantage of the best worlds on across Unix operating systems and Windows is very difficult, requiring one to understand intricate APIs and undocumented details from many different operating systems. There are several projects which have made attempts to provide an abstraction layer but in the author's opinion, none are completely satisfactory.

Marc Lehmann's libev and libeio. libev is the perfect minimal abstraction of the Unix I/O multiplexers. It includes several helpful tools like ev_async , which is for asynchronous notification, but the main piece is the ev_io , which informs the user about the state of file descriptors. As mentioned before, in general it is not possible to get state changes for regular files—and even if it were the write(2) and read(2) calls do not guarantee that they won't block. Therefore libeio is provided for calling various disk-related syscalls in a managed thread pool. Unfortunately the abstraction layer which libev targets is not appropriate for IOCPs—libev works strictly with file descriptors and does not the concept of a socket. Furthermore users on Unix will be using libeio for file I/O which is not ideal for porting to Windows. On windows libev currently uses select() —which is limited to 64 file descriptors per thread.

libevent. Somewhat bulkier than libev with code for RPC, DNS, and HTTP included. Does not support file I/O. libev was created after Lehmann evaluated libevent and rejected it—it's interesting to read his reasons why. A major rewrite was done for version 2 to support Windows IOCPs but anecdotal evidence suggests that it is still not working correctly.

Boost ASIO. It basically does what you want on Windows and Unix for sockets. That is, epoll on Linux, kqueue on Macintosh, IOCPs on Windows. It does not support file I/O. In the author's opinion is it too large for a not extremely difficult problem (~300 files, ~12000 semicolons).

File Types

Almost every socket operation that you're familiar with has an overlapped counter-part. The following section tries to pair Windows overlapped I/O syscalls with non-blocking Unix ones.

TCP Sockets

_setmaxstdio()