This article a is translation by popular request of Optimisations Nginx, bien comprendre sendfile, tcpnodelay et tcpnopush I wrote in French in January.

Most articles dealing with optimizing Nginx performances recommend to use sendfile, tcp_nodelay and tcp_nopush options in the nginx.conf configuration file. Unfortunately, almost none of them tell neither how they impact the Web server nor how they actually work.

Everything started with a after Greg did the peer review of my Nginx configuration. He was challenging my optimization, asking me if I really knew what I was doing. I started to dig into the TCP stack basement, as mixing sendfile, tcp_nodelay and tcp_nopush seemed to be as logical as a pacifist joining the Navy Seals (which have nothing with baby seals).

on tcp_nodelay

How can you force a socket to send the data in its buffer? A solution lies in the TCP_NODELAY option of the TCP (7) stack. Activating TCP_NODELAY forces a socket to send the data in its buffer, whatever the packet size. Nginx option tcp_nodelay adds the TCP_NODELAY options when opening a new socket.

To avoid network congestion, the TCP stack implements a mechanism that waits for the data up to 0.2 seconds so it won’t send a packet that would be too small. This mechanism is ensured by Nagle’s algorithm, and 200ms is the value of the UNIX implementation.

To understand Nagle’s purpose, you need to remember that Internet is not only about sending Web pages and huge files. Imagine yourself back in the 90s, using telnet to connect on a distant machine over a 14400 RTC connection. When you press ctrl+c, you send a one byte message to the telnet server. To that message, you need to add the IP headers (20 bytes for IPv4, 40 bytes for IPv6) and the TCP headers (20 bytes). When pressing ctrl+c, you actually send 61 bytes over the network. Angle ensures you may have something else to type before the data is sent.

That’s cool, but Nagle is not relevant to the modern Internet anymore. It is even counterproductive when you need to stream data over the network. Chances your file fills exactly a bunch of full packets are close to 0, which means Nagle creates a 0.2 seconds latency on the client side for every file it downloads.

The TCP_NODELAY option allows to bypass Naggle, and then send the data as soon as it’s available.

Nginx uses TCP_NODELAY on HTTP keepalive connections. keepalive connections are sockets that stay open for a few times after sending data. keepalive allows to send more data without initiating a new connection and replaying a TCP 3 ways handshake for every HTTP request. This saves both time and sockets as they don’t switch to FIN_WAIT after every data transfer. Connection: Keep-alive is an option in HTTP 1.0 and HTTP 1.1 default behavior.

When downloading a full Web page, TCP_NODELAY can save you up to 0.2 second on every HTTP request, which is nice. When it comes to online gaming or high frequency trading, getting rid of latency is critical even at the price of a relative network saturation.

on tcp_nopush

On Nginx, the configuration option tcp_nopush works as an opposite to tcp_nodelay. Instead of optimizing delays, it optimizes the amount of data sent at once.

To keep everything logical, Nginx tcp_nopush activates the TCP_CORK option in the Linux TCP stack since the TCP_NOPUSH one exists on FreeBSD only.

The well named TCP_CORK blocks the data until the packet reaches the MSS, which equals to the MTU minus the 40 or 60 bytes of the IP header.

Everything is well explained in the Linux kernel source code

/* Return false, if packet can be sent now without violation Nagle's rules:

* 1. It is full sized.

* 2. Or it contains FIN. (already checked by caller)

* 3. Or TCP_CORK is not set, and TCP_NODELAY is set.

* 4. Or TCP_CORK is not set, and all sent packets are ACKed.

* With Minshall's modification: all sent small packets are ACKed.

*/ static inline bool tcp_nagle_check(const struct tcp_sock *tp,

const struct sk_buff *skb,

unsigned int mss_now, int nonagle) return skb->len < mss_now &&

((nonagle & TCP_NAGLE_CORK) (!nonagle && tp->packets_out && tcp_minshall_check(tp)));

}

TCP_CORK needs to be explicitly removed if you want to send half empty (or half full) packets.

TCP(7) manpage explains that TCP_NODELAY and TCP_CORK are mutually exclusive, but they can be combined since Linux 2.5.9.

In Nginx configuration, tcp_nopush must be activated with sendfile, which is exactly where things get interesting.

On sendfile

Nginx initial fame came from its awesomeness at sending static files. This has lots to do with the association of sendfile, tcp_nodelay and tcp_nopush in nginx.conf. The sendfile Nginx option enables to use of sendfile(2) for everything related to… sending file.

sendfile(2) allows to transfer data from a file descriptor to another directly in kernel space. sendfile(2) allows to save lots of resources:

sendfile(2) is a syscall, which means execution is done inside the kernel space, hence no costly context switching.

sendfile(2) replaces the combination of both read and write.

here, sendfile(2) allows zero copy, which means writing directly the kernel buffer from the block device memory through DMA.

Unfortunately, sendfile(2) requires a file descriptor that supports mmap(2) and friends, which excludes UNIX sockets, for example as a way to send data to a local Rails backend without all the network latency.

The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket).

Depending on your needs, sendfile can be either totally useless or completely essential.

If you’re serving locally stored static files, sendfile is totally essential to speed your Web server. But if you use Nginx as a reverse proxy to serve pages from an application server, you can deactivate it. Unless you start serving micro caching on a tmpfs. I’ve been doing it here, and didn’t even notice the day I was featured on HN homepage, Reddit or good old Slashdot.

Let’s mix everything together

Things get really interesting when you mix senfile, tcp_nodelay and tcp_nopush together. I was wondering why anyone would mix 2 antithetic and mutually exclusive options. The answer lies deep inside a 2005 thread from the (Russian) Nginx mailing list.

Combined to sendfile, tcp_nopush ensures that the packets are full before being sent to the client. This greatly reduces network overhead and speeds the way files are sent. Then, when it reaches the last — probably halt — packet, Nginx removes tcp_nopush. Then, tcp_nodelay forces the socket to send the data, saving up to 0.2 seconds per file.

This behavior is confirmed in a comment from the TCP stack source about TCP_CORK:

When set indicates to always queue non-full frames. Later the user clears this option and we transmit any pending partial frames in the queue. This is meant to be used alongside sendfile() to get properly filled frames when the user (for example) must write out headers with a write() call first and then use sendfile to send out the data parts. TCP_CORK can be set together with TCP_NODELAY and it is stronger than TCP_NODELAY.

Nice isn’t it?

Here we are, I think we’re done. I did not mention writev(2) as an alternative to tcp_nopush on purpose to avoir adding complexity. I hope you enjoyed reading this, don’t mind sending me an email if you have something to add, I’ll publish it with pleasure.

Many thanks to Arthur, Bruno, Bsdsx and Ludovicfor proofreading this article, and to Greg for both his deep knowledge and for kicking my ass until I came back to him with answers to his questions.

Original article published on Nginx Optimization: understanding sendfile, tcp_nodelay and tcp_nopush