

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

[[About Vol.2 of the upcoming “Development and Deployment of MMOG” book. There is no need to worry, I just need some time to prepare for publishing of Vol.1. “beta” chapters of Vol.2 are planned to start appearing next week. Stay tuned!]]

Discussion on the advantages of TCP vs UDP (and vice versa) has a history which is almost as long as the eternal Linux-vs-Windows debate. As I have long been a supporter of the point of view that both UDP and TCP have their own niches (see, for example, [NoBugs15]), here are my two cents on this subject.

Note for those who already know the basics of IP and TCP: please skip to the ‘Closing the Gap: Improving TCP Interactivity’ section, as you still may be able to find a thing or two of interest.

IP: just packets, nothing more

“All the internet protocol (and IP stack) does is it provides a way to deliver data packets from host A to host B, using an IP address as an identifierAs both TCP and UDP run over IP, let’s see what the internet protocol (IP) really is. For our purposes, we can say that:

we have two hosts which need to communicate with each other

each of the hosts is assigned its own IP address

the internet protocol (and IP stack) provides a way to deliver data packets from host A to host B, using an IP address as an identifier

In practice, of course, it is much more complicated than that (with all kinds of stuff involved in the operation of the IP, from ICMP and ARP to OSPF and BGP), but for now we can more or less safely ignore the complications as implementation details.

What we need to know, though, it is that IP packet looks as follows:

IP Header (20 to 24 bytes for IPv4) IP Payload

One very important feature of IP is that it does not guarantee packet delivery. Not at all. Any single packet can be lost, period. It means that any number of packets can also be lost.

IP works only statistically; this behaviour is by design; actually, it is the reason why backbone Internet routers are able to cope with enormous amounts of traffic. If there is a problem (which can range from link overload to sudden reboot), routers are allowed to drop packets.

Within the IP stack, it is the job of the hosts to provide delivery guarantees. Nothing is done in this regard en route.

UDP: datagrams ~= packets

Next, let’s discuss the simpler one of our two protocols: UDP. UDP is a very basic protocol which runs on top of IP. Actually, it is that basic, that when UDP datagrams run on top of IP packets, there is always 1-to-1 correspondence between the two, and all UDP does is add a very simple header (in addition to IP headers), the header consisting of 4 fields: source port, destination port, length, and checksum, making it 8 bytes in total.

So, a typical UDP packet will look as follows:

IP Header (20 to 24 bytes for IPv4) UDP Header (8 bytes) UDP Payload

The UDP ‘Datagram’ is pretty much the same as an IP ‘packet’, with the only difference between the two being the 8 bytes of UDP header; for the rest of the article we’ll use these two terms interchangeably.

As UDP datagrams simply run on top of IP packets, and IP packets can be lost, UDP datagrams can be lost too.

TCP: stream != packets

In contrast with UDP, TCP is a very sophisticated protocol, which does guarantee reliable delivery.

The only relatively simple thing about TCP is its packet:

IP Header (20 to 24 bytes for IPv4) TCP Header (20 to 60 bytes) TCP Payload

Usually, the size of a TCP header is around 20 bytes, but in relatively rare cases it may reach up to 60 bytes.

As soon as we’re past the TCP packet, things become complicated. Here is an extremely brief and sketchy description of TCP working1:

TCP interprets all the data to be communicated between two hosts as two streams (one stream going from host A to host B, and another going in the opposite direction)

whenever the host calls the TCP function send() , the data is pushed into the stream

, the data is pushed into the stream the TCP stack keeps a buffer (usually 2K–16K in size) on the sending side; all the data pushed to the stream goes to this buffer. If the buffer is full, send() won’t return until there is enough space in the buffer 2

won’t return until there is enough space in the buffer “ data is not removed from the sending TCP buffer at the moment when TCP packet is sent

After the receiving side gets the TCP packet, it sends a TCP packet in the opposite direction, with a TCP header with the ACK bit set – an indication that a certain portion of the data has been received. On receiving this ACK packet, the sender may remove the corresponding piece from its TCP sending buffer. 3

bit set – an indication that a certain portion of the data has been received. On receiving this packet, the sender may remove the corresponding piece from its TCP sending buffer. Data received goes to another buffer on the receiving side; again, its size is usually of the order of 2K–16K. This receiving buffer is where the data for the recv() function comes from.

function comes from. If the sending side doesn’t receive an ACK in a predefined time – it will re-send the TCP packet. This is the primary mechanism by which TCP guarantees delivery in case of the packet being lost.4

So far so good. However, there are some further caveats. First of all, when re-sending because of no-ACK timeouts, these timeouts are doubled for each subsequent re-send. The first re-send is usually sent after time T1, which is double the so-called RTT (round-trip-time, which is measured by the host as a time interval between the moment when the packet was sent and another moment when the ACK for the packet was received); the second re-send is sent after time T2=2*T1, the third one after T3=2*T2=4*T1, and so on. This feature (known as ‘exponential back-off’) is intended to avoid the Internet being congested due to too many retransmitted packets flying around, though the importance of exponential back-off in avoiding Internet congestion is currently being challenged [Mondal]. Whatever the reasoning, exponential back-off is present in every TCP stack out there (at least, I’ve never heard of any TCP stacks which don’t implement it), so we need to cope with it (we’ll see below when and why it is important).

“Another caveat related to interactivity is the so-called Nagle algorithm.Another caveat related to interactivity is the so-called Nagle algorithm. Originally designed to avoid telnet sending 41-byte packets for each character pressed (which constitutes a 4000% overhead), it also allows the hiding of more packet-related details from a “TCP as stream” perspective (and eventually became a way to allow developers to be careless and push data to the stream in as small chunks as they like, in the hope that the ‘smart TCP stack will do all the packet assembly for us’). The Nagle algorithm avoids sending a new packet as long as there is (a) an unacknowledged outstanding packet and (b) there isn’t enough data in the buffer to fill the whole packet. As we will see below, it has significant implications on interactivity (but, fortunately for interactivity, usually Nagle can be disabled).

TCP: Just the ticket? No so fast 🙁

Some people may ask: if TCP is so much more sophisticated and more importantly, provides reliable data delivery, why not just use TCP for every network transfer under the sun?

Unfortunately, it is not that simple. Reliable delivery in TCP does have a price tag attached, and this price is all about loss of interactivity :-(.

Let’s imagine a first-person shooter game that sends updates, with each update containing only the position of the player. Let’s consider two implementations: Implementation U which sends the player position over UDP (a single UDP packet every 10ms, as the game is fast and the position is likely to change during this time anyway), and Implementation T which sends the player position over TCP.

First of all, with Implementation T, if your application is calling send() every 10 ms, but RTT is, say, 50ms, your data updates will be delayed (according to the Nagle algorithm, see above). Fortunately, the Nagle algorithm can be disabled using the TCP_NODELAY option (see the section ‘Closing the gap: improving TCP interactivity’ for details).

“If the Nagle algorithm is disabled, and there are no packets lost on the way – there won’t be any difference between UDP-based and TCP-based implementations.If the Nagle algorithm is disabled, and there are no packets lost on the way (and both hosts are fast enough to process the data) – there won’t be any difference between these implementations. But what will happen if some packets are lost?

With Implementation U, even if a packet is lost, the next packet will heal the situation, and the correct player position will be restored very quickly (at most in 10 ms). With Implementation T, however, we cannot control timeouts, so the packet won’t be re-sent until around 2*RTT; as RTT can easily reach 50ms even for a first-person shooter (and is at least 100–150ms across the Atlantic), the retransmit won’t happen until about 100ms, which represents a Big Degradation compared to Implementation U.

In addition, with Implementation T, if one packet is lost but the second one is delivered, this second packet (while present on the receiving host) won’t be delivered to the application until the second instance of the first packet (i.e. the first packet retransmitted on timeout) is received; this is an inevitable consequence of treating all the data as a stream (you cannot deliver the latter portion of the stream until the former one is delivered).

To make things even worse for Implementation T, if there is more than one packet lost in a row, then the second retransmit with Implementation T won’t come until about 200ms (assuming 50ms RTT), and so on. This, in turn, often leads to existing TCP connections being ‘stuck’ when new TCP connections will succeed and will work correctly. This can be addressed, but requires some effort (see ‘Closing the gap: improving TCP interactivity’ section below).

So, should we always go with UDP?

In Implementation U described above, UDP worked pretty well, but this was closely related to the specifics of the messages exchanged. In particular, we assumed that every packet has all the information necessary, so loss of any packet will be ‘healed’ by the next packet. If such an assumption doesn’t hold, using UDP becomes non-trivial.

Also, the whole schema relies on us sending packets every 10ms; this may easily result in sending too much traffic even if there is little activity; on the other hand, increasing this interval with Implementation U will lead to loss of interactivity.

What should we do then?

Basically, the rules of thumb are about the following:

If characteristic times for your application are of the order of many hours (for example, you’re dealing with lengthy file transfers) – TCP will do just fine, though it is still advisable to use TCP built-in keep-alives (see below).

If characteristic times for your application are below ‘many hours’ but are over 5 seconds – it is more or less safe to go with TCP. However, to ensure interactivity consider implementing your ‘Own Keep-Alives’ as described below.

“ If characteristic times for your application are (very roughly) between 100ms and 5 seconds – this is pretty much a grey area.

If characteristic times for your application are below 100ms – it is very likely that you need UDP. See the ‘Closing the gap: reliable UDP’ section below on the ways of adding reliability to UDP.

Closing the gap: reliable UDP

In cases when you need to use UDP but also need to make it reliable, you can use one of the ‘reliable UDP’ libraries [Enet] [UDT] [RakNet]. However, these libraries cannot do any magic, so they’re essentially restricted to retransmits at some timeouts. Therefore, before using such a library, you will still need to understand very well how exactly it achieves reliable delivery, and how much interactivity it sacrifices in the process (and for what kind of messages).

It should be noted that when implementing reliable UDP, the more TCP features you implement, the more chances there are that you end up with an inferior implementation of TCP. TCP is a very complex protocol (and most of its complexity is there for a good reason), so attempting to implement ‘better TCP’ is extremely difficult. On the other hand, implementing ‘reliable UDP’ at the cost of dropping most of TCP functionality, is possible.

EDIT: [Wikipedia.QUIC] by Google is an interesting effort in this regard, stay tuned for their further developments.

Closing the gap: improving TCP interactivity

There are several things which can be done to improve interactivity of TCP.

Keep-alives and ‘stuck’ connections

One of the most annoying problems when using TCP for interactive communication is ‘stuck’ TCP connections. When you see a browser page which ‘stuck’ in the middle, then press ‘Refresh’ – and bingo! – here is your page, then chances are you have run into such a ‘stuck’ TCP connection.

One way to deal with ‘stuck’ TCP connections (and without your customer working as a freebie error handler) is to have some kind of ‘keep alive’ messages which the parties exchange every N seconds; if there are no messages on one of the sides for, say, 2*N time – you can assume that TCP connection is ‘stuck’, and try to re-establish it.

TCP itself includes a Keep-Alive mechanism (look for SO_KEEPALIVE option for setsockopt()), but it is usually of the order of 2 hours (and worse, at least under Windows it is not configurable other than via a global setting in the Registry, ouch).

“Quite often, you need to create your own keep-alive over TCP, with the timeouts you need.So, if you need to detect your ‘stuck’ TCP connection earlier than in two hours, and your operating systems on both sides of your TCP connection don’t support per-socket keep alive timeouts, you need to create your own keep-alive over TCP, with the timeouts you need. It is usually not rocket science, but is quite a bit of work.

The basic way of implementing your own keep-alive usually goes as follows:

You’re splitting your TCP stream into messages (which is usually a good idea anyway); each message contains its type, size, and payload

One message type is MY_DATA , with a real payload. On receiving it, it is passed to the upper layer. Optionally, you may also reset a ‘connection is dead’ timer.

Another message type is MY_KEEPALIVE , without any payload. On receiving it, it is not passed to the upper layer, but a ‘connection is dead’ timer is reset.

, without any payload. On receiving it, it is not passed to the upper layer, but a ‘connection is dead’ timer is reset. MY_KEEPALIVE is sent whenever there are no other messages going over the TCP connection for N seconds

is sent whenever there are no other messages going over the TCP connection for N seconds When a ‘connection is dead’ timer expires, the connection is declared dead and is re-established. While this is a fallacy from traditional TCP point of view, it has been observed to help interactivity in a significant manner. As an optimization, you may want to keep the original connection alive while you’re establishing a new one; if the old connection receives something while you’re establishing the new one, you can resume communication over the old one, dropping new one.



TCP_NODELAY

“with TCP_NODELAY you should always assemble the whole message before calling send()One of the most popular ways to improve TCP interactivity is enabling TCP_NODELAY over your TCP socket (again, as a parameter of setsockopt() function).

If TCP_NODELAY is enabled, then the Nagle algorithm is disabled5

However, TCP_NODELAY is not without its own caveats. Most importantly, with TCP_NODELAY you should always assemble the whole message-you-want-to-send before calling send(). Otherwise, each of your calls to send() will cause the TCP stack to send a packet (with the associated 40–84 bytes overhead, ouch).

Out-of-Band Data

TCP OOB (Out-of-Band Data) is a mechanism which is intended to break the stream and deliver some data with a higher priority. As OOB adds priority (in a sense that it bypasses both the TCP sending buffer and TCP receiving buffer), it may help to deal with interactivity. However, with TCP OOB being able to send only one byte (while you can call send(…,MSG_OOB) with more than one byte, only the last byte of the block will be interpreted as OOB), its usefulness is usually quite limited.

“One scenario when MSG_OOB works pretty well, is to send an ‘abort’ command during a long file transferOne scenario when MSG_OOB works pretty well (and which is used in protocols such as FTP), is to send an ‘abort’ command during a long file transfer; on receiving OOB ‘abort’, the receiving side simply reads all the data from the stream, discarding it without processing, until the OOB marker (the place in the TCP stream where send(…,MSG_OOB) has been called on sending side) is reached. This way, all the TCP buffers are effectively discarded, and the communication can be resumed without dropping the TCP connection and re-establishing a new one. For more details on MSG_OOB see [Stevens] (with a relevant chapter available on [Masterraghu]).

Residual issues

Even with all the tricks above, TCP is still lacking interactivity-wise. In particular, out-of-order data delivery of over-1-byte-size is still not an option, stale-and-not-necessary-anymore data will still be retransmitted even if they’re not necessary, and dealing with ‘stuck’ connections is just a way to mitigate the problem rather than to address it. On the other hand, if your application is relatively tolerant to delays, ‘other considerations’ described below may easily be a deciding factor in your choice of protocol.

Other considerations

If you’re lucky enough and the requirements of your application can be satisfied by both TCP and UDP, other considerations may come into play. These considerations include (but are not limited to):

EDIT: pretty much whenever we can use TCP, we can also use so-called “Websockets”. They inherit most of the advantages and disadvantages of TCP, but are usually better suitable for web apps (and are generally even a bit more firewall-friendly than plain TCP)

pretty much whenever we can use TCP, we can also use so-called “Websockets”. They inherit most of the advantages and disadvantages of TCP, but are usually better suitable for web apps (and are generally even a bit more firewall-friendly than plain TCP) “ if you want your user to be able to connect from a hotel room, or from work – TCP usually tends to work better

TCP has flow control, UDP as such doesn’t.

TCP is generally more firewall- and NAT-friendly. Which can be roughly translated as ‘if you want your user to be able to connect from a hotel room, or from work – TCP usually tends to work better, especially if going over port 80, or over port 443’. [QUIC] estimates the number of people-who-have-outbound-TCP-connectivity-but-don’t-have-outbound-UDP at 6-9%. Whether it is too much – depends on your application, but I strongly suggest to have a TCP fallback at least for those apps where it is not completely hopeless.

TCP is significantly simpler to program for. While TCP is not without caveats, not discussed here (see also [NoBugs15a]), dealing with UDP so it works without major problems generally takes significantly longer.

TCP generally has more overhead, especially during connection setup and connection termination. The overall difference in traffic will usually be small, but this might still be a valid concern.

Conclusions

“The most critical factor in selection of TCP over UDP or vice versa is usually related to acceptable delaysThe choice of TCP over UDP (or vice versa) might not always be obvious. In a sense, replacing TCP with UDP is trading off reliability for interactivity.

The most critical factor in selection of one over another one is usually related to acceptable delays; TCP is usually optimal for over-several-seconds times, and UDP for under-0.1-second times, with anything in between being a ‘grey area’. On the other hand, other considerations (partially described above) may play their own role, especially within the ‘grey area’.

Also, there are ways to improve TCP interactivity as well as UDP reliability (both briefly described above); this often allows to close the gap between the two.

[ + ] Disclaimer as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04] ) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

Acknowledgements

This article has been originally published in Overload Journal #130 in December 2015 and is also available separately on ACCU web site. Re-posted here with a kind permission of Overload. The article has been re-formatted to fit your screen.

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.