This article accompanies Go issue #10948 (net: use splice for TCPConn.ReadFrom on Linux) and CL 107715 , and aims to provide context, insight into implementation decisions, performance measurements, and ideas for future work.

Introduction

splice(2) moves data between file descriptors, without copying between kernel and user address space. A birds-eye view of the concept is that splice is akin to “ read from a file descriptor to a kernel buffer” or “ write from a kernel buffer to a file descriptor”. The buffer is controlled by the user, and is, in fact, a plain UNIX pipe. At least one of the file descriptors passed to splice must refer to a pipe.

One might ask “Why is the pipe important? Why can’t splice transfer data between arbitrary file descriptors directly?”. The short answer to this question is that having an intermediary place to store the data is very useful, for various reasons. We will not pursue this line of questioning further in this article, because there is much to talk about. Instead, the interested reader is encouraged to read an e-mail thread where Linus Torvalds himself provides excellent insight on the matter, and explains the general concepts behind splice and its friend tee(2) very clearly.

If splice is indeed a general data transfer function, then why does issue #10948 only mention net and TCPConn ? The next section answers this question.

Implementation details

Per rsc’s comment in the original issue, the objective is for the standard library support for splice to be transparent, like the existing sendfile optimization. The change must not introduce any new API, and callers – who typically use io.Copy to move data around – should benefit from it transparently. Package os and package net should remain as portable as possible, without introducing new API for OS-specific optimizations.

As with sendfile , using splice requires help from the Go network poller. The poller is implemented in the runtime, and is exposed to standard library code through package internal/poll . All standard library types which wrap a file descriptor hold a *poll.FD . Code outside the standard library cannot make use of the poller, at least not directly. While splice is a very general data transfer function, it operates on raw file descriptors, which are not exposed by the Go standard library directly.

The typical function used by code that wants to transfer data from an io.Reader to an io.Writer without looking at it is io.Copy . io.Copy investigates its io.Writer and io.Reader arguments for io.ReaderFrom and io.WriterTo specializations respectively, and uses the specialized code, if possible. Otherwise, it falls back to a generic read-write loop.

For example, if conn is a *net.TCPConn (perhaps packaged in a net.Conn ), and f is an *os.File , a call to io.Copy(conn, f) results in a call to conn.ReadFrom(f) . ReadFrom checks if the io.Reader argument is an *os.File and uses sendfile through poll.SendFile if it is. Usage of splice in the standard library must follow a similar pattern.

Similarly to how io.Copy takes the liberty to allocate a buffer for the data transfer if necessary, uses of splice on code paths known to be optimizable could take the liberty to create a pipe scoped to the data transfer.

Let’s investigate candidates for file descriptors we could splice to, under the condition that no new API may be introduced:

*os.File doesn’t have a ReadFrom , and both files and pipes are represented by *os.File , so splicing to a file or a pipe is not possible.

doesn’t have a , and both files and pipes are represented by , so splicing to a file or a pipe is not possible. *net.UnixConn has a ReadFrom , but that ReadFrom is the PacketConn ReadFrom . The signature doesn’t match, and cannot be changed.

has a , but that is the . The signature doesn’t match, and cannot be changed. *net.TCPConn has the right ReadFrom and is thus the only real candidate for the write side.

The entry point for all splice optimizations must be (*net.TCPConn).ReadFrom . We have our destination file descriptor. Next, we investigate possible source file descriptors to represent the Reader passed to ReadFrom .

If the reader is an *os.File , we are on the sendfile code path… but not quite. The file could represent the read half of a pipe, which we could splice to the connection directly. The sendfile code calls Fd() on the file to get at the file descriptor, which is set to blocking mode. sendfile can get away with this, because disk files are always ready from the perspective of epoll . The splice code is not so lucky, because pipes must be polled for readiness. After calling Fd() on the file, attempting to register the returned file descriptor with the network poller is not possible. There is no way package net can get at the *poll.FD of the *os.File . Only very intrusive solutions which involve duping the pipe file descriptor come to mind. Unfortunate.

, we are on the code path… but not quite. The file could represent the read half of a pipe, which we could splice to the connection directly. The code calls on the file to get at the file descriptor, which is set to blocking mode. can get away with this, because disk files are always ready from the perspective of . The code is not so lucky, because pipes must be polled for readiness. After calling on the file, attempting to register the returned file descriptor with the network poller is not possible. There is no way package can get at the of the . Only very intrusive solutions which involve duping the pipe file descriptor come to mind. Unfortunate. *net.UnixConn seems like a good candidate, and the initial implementation enabled splice for this case, but benchmarks prove to be inconclusive. It turns out UNIX sockets are quite fast as-is.

seems like a good candidate, and the initial implementation enabled for this case, but benchmarks prove to be inconclusive. It turns out UNIX sockets are quite fast as-is. *net.TCPConn remains.

We are left with only one clear candidate: TCP socket to TCP socket transfers, using a temporary pipe. To perform a copy from a TCP connection to another, package net takes care to honor io.LimitedReader , unwraps the source and destination connections to get at their *poll.FD s, then transfers control to the new poll.Splice function, which performs most of the work.

The reader is encouraged to refer to the source code for poll.Splice for the remainder of this section. poll.Splice creates a temporary pipe, locks the source and destination file descriptors, and prepares the file descriptors for a new round of polling. Then, it alternates between splicing from the source socket to the pipe, and from the pipe to the destination socket. This alternation deserves some extra attention.

To move data from the source socket to the pipe, Splice calls spliceDrain . This is the equivalent of the Read part of an io.Copy in userspace. Conversely, splicePump is the equivalent of the Write part. Note how spliceDrain only attempts a single splice into the pipe, whereas splicePump loops until it has spliced all the data out of the pipe.

This asymmetry is intentional. If Splice simply alternated reads and writes, the pipe could fill up, if the source socket outpaced the destination socket. This would be problematic for multiple reasons.

First, if Splice saw EAGAIN , it would need to determine if the cause was the source socket not being ready for reading, or the pipe being full. This would complicate the implementation slightly.

Second, consider the following situation: At some point in the data transfer, a splice from the pipe to the destination socket is short, and leaves some data in the pipe. The next attempted splice is from the source socket to the pipe. The source socket times out, but it takes 30 seconds for that to happen. In the meantime, the old data is still stuck in the pipe, and the destination socket can only receive it when the source socket eventually times out. To mitigate this, Splice would need to add timeouts to the poller events it waits for, and implement some form of flow control.

All of this would complicate Splice , for no tangible gain. Therefore, the Splice implementation remains simple, and mirrors the behavior of io.Copy by ensuring that all the data read from the source socket is written to the destination socket, before new data is read again.

A final implementation detail is the intentional omission of the SPLICE_F_MORE flag from calls to splice . SPLICE_F_MORE acts like TCP_CORK and is not an opinion standard library code should impose upon callers, who should not wake up to 200ms of unexpected latency introduced by io.Copy when they upgrade Go versions. If the TCP_CORK behavior is desirable, callers can cork the TCP connection themselves before initiating the copy, using syscall.RawConn , or a specialized package like mikioh/tcp.

We move on to a brief performance analysis of the new code.

Performance analysis

CL 107715 includes a set of benchmarks, but they are superficial at best: the code is probably only hitting the loopback interface, and it doesn’t quite measure the most important impact of the change. In an attempt to improve on this state of affairs, I conducted a more concrete performance test.

Three AWS m5.large nodes were used. One minute’s worth of network traffic was moved from one node to another, using the third node as a TCP proxy. The new splice -optimized code was tested against the old code. CPU profiles and an execution traces were recorded for both runs.

Both proxy servers were able to saturate 10Gbps links, but the splice -optimized server spent much less CPU time in doing so.

To begin with, basic time measurements:

real 1m1.444s user 0m13.875s sys 0m43.481s

for the default server, versus

real 1m1.211s user 0m8.391s sys 0m28.498s

for the splice -optimized server.

Both proxy servers moved ~70GiB of data in this time, but the execution traces tell an interesting tale. A 2ms window into the execution trace of the unoptimized server looks like this:

On the other hand, for the splice -optimized server, a similar 2ms window looks like this:

Because the speed of the data transfer is bound by the speed of the network in both cases, roughly the same work is performed in both windows. However, the splice -optimized server spends significantly less CPU time doing so. There are significant gaps between sections of activity throughout the execution trace of the splice -optimized server. The non-optimized server trace also shows gaps, but they are significantly more narrow: the server is busier overall.

Compared to the un-optimized server, the splice -enabled server seems to be able to move very large chunks of data at a time. The default buffer size for io.Copy is 32 KiB. Copies through userspace are strictly bound by the size of the buffer. On the other hand, the default buffer size for a Linux pipe is 64 KiB. To make the test more fair, perhaps io.CopyBuffer with a 64 KiB buffer could have been used, but even that fails to make the comparison more reasonable.

For reasons I do not understand, the kernel was willing to move chunks larger than the presumed pipe buffer size in a single call to splice . Perhaps this stems from the fact that the data pages originated in the kernel, and it moved them to the pipe using a simple reference count increase, instead of copying data from userspace to pages allocated for the pipe. The exact process is still unclear to me, because I do not understand the inner workings of the Linux kernel very well. Clarifications would be appreciated.

CPU profiles (tip and 1.10) were also taken, but they don’t tell us anything we haven’t been able to discern from the execution traces already. That being said, one of them certainly looks cleaner than the other!

In conclusion, the optimization is not as much about raw speed, as it is about CPU time. Proxy servers and load balancers in high-concurrency scenarios are almost certain to benefit greatly from it.

Future work

Given the fact that the bar for new API added to the standard library is pretty high, it is unlikely to see splice used in many other places than where it is used now.