View source on Github

This is the third article for the series of "Improving the performance of Warp". As I wrote in "Sending header and body at once", I will explain how to avoid the open() / close() system calls in this article.

simple-sendfile basis

I designed the simple-sendfile to use as few system calls as possible, since "system calls are evil for network programming in Haskell". To make the story simple, I will only talk about the sendfile system call on Linux.

The type of the sendfile function in Network.Sendfile is as follows:

sendfile :: Socket -> FilePath -> FileRange -> IO () -> IO ()

FileRange has two constructors: EntireFile and PartOfFile .

Let's consider the case where we send the entire file by specifying EntireFile to the sendfile function. Unfortunately, the function has to call the stat() system call to know the size of the file because the sendfile() system call on Linux requires the caller to specify how many bytes to be sent ( sendfile() on FreeBSD/MacOS has magic number '0' which indicates the end of file).

If WAI applications know the file size, they can specify PartOfFile to avoid the stat() system call. It is easy for WAI applications to cache file information such as size and modification time. If cache timeout is fast enough (say 10 seconds), the risk of cache inconsistency problem is not serious. And because we can safely clean up the cache, we don't have to worry about leakage.

open() and close()

If PartOfFile is specified, the sendfile function calls open() , sendfile() repeatedly if necessary, and close() . When I implemented the simple-sendfile package, I noticed that open() and close() should also be eliminated. For this, we should cache file descriptors.

Caching file descriptors works as follows: If a client requests that a file be sent, a file descriptor is opened. And if another client requests the same file shortly thereafter, the file descriptor is reused. At a later time, the file descriptor is closed if no Haskell thread uses it.

Sound easy? Unfortunately, I had no idea on how to safely cache file descriptors. It seems to me that it is difficult to ensure that no Haskell thread uses a file descriptor when closing it.

A typical tactic for this case is reference counter. But I was not sure that I could implement a robust mechanism for a reference counter. What happens if a Haskell thread is killed for unexpected reasons? If we fail to decrement its reference counter, the file descriptor leaks.

Andreas motivated me to consider this issue again by pointing out that the performance bottleneck of Warp is open() / close() . I thought this over for a month and all the necessary pieces came together suddenly.

The timeout manager of Warp

I implemented a cache mechanism for file descriptor based on Warp's timeout system. So, let me explain how Warp's timeouts work first. For security reasons, Warp kills a Haskell thread, which communicates with a client, if the client does not send a significant amount of data for a specified period (30 seconds by default). I think that the heart of Warp's timeout system is the following two points:

Double IORef s

s Safe swap and merge algorithm

Suppose that status of connections is described as Active and Inactive . To clean up inactive connections, a dedicated Haskell thread, called the timeout manager, repeatedly inspects the status of each connection. If status is Active , the timeout manager turns it to Inactive . If Inactive , the timeout manager kills its associated Haskell thread.

Each status is refereed by an IORef . To update status through this IORef , atomicity is not necessary because status is just overwritten. In addition to the timeout manager, each Haskell thread repeatedly turns its status to Active through its own IORef if its connection actively continues.

To check all statuses easily, the timeout manager uses a list of the IORef to status. A Haskell thread spawned for a new connection tries to 'cons' its new IORef for an Active status to the list. So, the list is a critical section and we need atomicity to keep the list consistent.

A standard way to keep consistency in Haskell is MVar . But as Michael Snoyman pointed out in "Warp: A Haskell Web Server", MVar (in threaded RTS) is slow. This is because each MVar is protected with a home-brewed spin lock. So, he used another IORef to refer the list and atomicModifyIORef to manipulate it. atomicModifyIORef is implemented via CAS (Compare-and-Swap), which is much faster than spin locks.

The following is the outline of the safe swap and merge algorithm:

do xs <- atomicModifyIORef ref (\ys -> ([], ys)) -- swap with [] xs' <- manipulates_status xs atomicModifyIORef ref (\ys -> (merge xs' ys, ()))

The timeout manager atomically swaps the list with an empty list. Then it manipulates the list by turning status and/or removing unnecessary status for killed Haskell threads. During this process, new connections may be created and their status are inserted with atomicModifyIORef by corresponding Haskell threads. Then, the timeout manager atomically merges the pruned list and the new list.

The algorithm to cache file descriptors

Warp's timeout approach is safe to reuse as a cache mechanism for file descriptors because it does not use reference counters. However, we cannot simply reuse Warp's timeout code for some reasons:

Each Haskell thread has its own status. So, status is not shared. But we would like to cache file descriptors to avoid open() and close() by sharing. So, we need to search a file descriptor for a requested file from cached ones. Since this look-up should be fast, we should not use a list. You may think Data.Map can be used. Yes, its look-up is O(log N) but there are two reasons why we cannot use it:

Data.Map is a finite map which cannot contain multiple values for a single key. Data.Map does not provide a fast pruning method.

Problem 1: because requests are received concurrently, two or more file descriptors for the same file may be opened. So, we need to store multiple file descriptors for a single file name. We can solve this by re-implementing Data.Map to hold a non-empty list. This is technically called a "multimap".

Problem 2: Data.Map is based on a binary search tree called "weight balanced tree". To the best of my best knowledge, there is no way to prune the tree directly. You may also think that we can convert the tree to a list ( toList ), then prune it, and convert the list back to a new tree ( fromList ). The cost of the first two operations is O(N) but that of the last one is O(N log N) unfortunately.

One day, I remembered Exercise 3.9 of "Purely Functional Data Structure" - to implement fromOrdList which constructs a red-black tree from an ordered list in O(N). My friends and I have a study meeting on this book every month. To solve this problem, one guy found a paper by Ralf Hinze, "Constructing Red-Black Trees". If you want to know its concrete algorithm, please read this paper.

Since red-black trees are binary search trees, we can implement multimap by combining it and non-empty lists. Fortunately, the list created with toList is sorted. So, we can use fromOrdList to convert the sorted list to a new red-black tree. Now we have a multimap whose look-up is O(log N) and pruning is O(N).

The cache mechanism has already been merged into the master branch of Warp, and is awaiting release.

UPDATE: Discussion here is based on my ignorance. Please read also comments.

New functions in simple-sendfile

I explained the sendfile function and the sendfileWithHeader function in this article and the previous one, respectively:

sendfile :: Socket -> FilePath -> FileRange -> IO () -> IO () sendfileWithHeader :: Socket -> FilePath -> FileRange -> IO () -> [ByteString] -> IO ()

To avoid the open() / close() system call, I added two more functions to the simple-sendfile package:

sendfileFd :: Socket -> Fd -> FileRange -> IO () -> IO () sendfileFdWithHeader :: Socket -> Fd -> FileRange -> IO () -> [ByteString] -> IO ()

Of course, the master branch of Warp uses the last one.

That's all. Thank you for reading this long article.