How to achieve Gigabit speeds with Linux

Before you begin with the configuration, MAKE SURE YOU HAVE PERMISSION TO TRANSFER DATA ON THE ENTIRE NETWORK. Gbit/s transfers can create real damage on a network and are not (yet) appropriate for production environments.

Although this document is mainly about achieving data transfers using TCP some parts are also useful if you wish to use other transport protocols. You should always, if possible, try first UDP traffic data transfers and when everything is fine move towards TCP.

TCP will usually work well if the traffic is not competing with other flows. If any of the segments in the data path contains other traffic TCP will perform worse. This is discussed in more detail in section 5

1 - HARDWARE

1. 1 Gbit/s network cards PCI 64-66MHz bus recommended (4Gbit/s theoretical bus limit) Pay attention to the shared buses on the motherboard (for ex. the SuperMicro motherboard for Intel Xeon Processors splits the PCI bus in 3 segments: PCI slots 1-3, PCI slot 4 and the on-board SCSI controller if one exists and PCI slots 5-6. 2. Intel 10Gbit/s network cards. a)PCI-X 133 MHz bus recommended (8.5 Gbit/s theoretical limit) b)Processor (and motherboard) with 533 MHz front-side bus c)PCI slot configured as bus master (improved stability) d)The Intel 10 Gbit/s card should be alone on its bus segment for optimal performance. e)The PCI burst size should be increased to 512K f)The card driver should be configured with the following parameters: i. Interrupt coalescence ii. Jumbo frames iii. Gigabit Ethernet flow control (it should be active also on the connecting switch) iv. Increased network card packet buffers (default 1024, maximum 4096) v. RX and TX checksum offload enabled

2 - DRIVERS ISSUES

interrupt coalescence settings

When increasing the IC there should be a sufficient numbers of descriptors in the ring-buffers associated with the interface to hold the number of packets expected between consecutive interrupts.

As expected when increasing the IC settings the value of the latency increases, so that the difference in latency reflects the increased length of time packets spend in the NICs memory before being processed by the kernel.

If TxInt is reduced to 0 the throughput is significantly affected for all values of RxInt due to increased PCI activity and insufficient power to cope with the context switching in the sending PC.

If CPU power is important for your system (for example a shared server machine) than it is recommended to use a high interrupt coalescence in order to moderate CPU usage. If the machine is going to be dedicated to a single transfer than interrupt coalescence should be off.

NAPI

Although NAPI is compatible with the old system and so with the old driver, you need to use a NAPI-aware driver to enable this improvement in your machine. It exists e.g. for Syskonnect Gigabit card [LINK TO BE PROVIDED BY MATHIEU].

The NAPI network subsystem is a lot more efficient than the old system, especially in a high performance context. The pros are:

limitation of interruption rate (you can see it like an adaptative interruption coalescing mechanism) ;

not prone to receive livelock [3];

better data & instruction locality.

3 - KERNEL CONFIGURATION

3.1 iftxtqueue length high

These settings are especially important for TCP as losses on local queues will cause TCP to fall into congestion control – which will limit the TCP sending rates. Meanwhile, full queues will cause packet losses when transporting udp packets.

There are two queues to consider, the txqueuelen; which is related to the transmit queue size, and the netdev_backlog; which determines the recv queue size.

To set the length of the transmit queue of the device. It is useful to set this to small values for slower devices with a high latency (modem links, ISDN) to prevent fast bulk transfers from disturbing interactive traffic like telnet too much.

Users can manually set this queue size using the ifconfig command on the required device. Eg.

/sbin/ifconfig eth2 txqueuelen 2000

The default of 100 is inadequate for long distance, high throughput pipes. For example, on a network with a rtt of 120ms and at Gig rates, a txqueuelen of at least 10000 is recommended.

3.2 kernel receiver backlog

/sbin/sysctl –w sys.net.core.netdev_max_backlog=2000

3.3 TCP cache parameter (Yee)

In order to rectify this, one can flush all the tcp cache settings using the command:

/sbin/sysctl –w sys.net.ipv4.route.flush=1

Note that this flushes all routes, and is only temporary – ie, one must run this command every time the cache is to be emptied.

3.4 SACKs and Nagle

/sbin/sysctl -w net.ipv4.tcp_sack=0

Nagle algorithm should however be turned on. This is the default value. You can check if your program has Nagle switched off of it sets the TCP_NODELAY socket option. If this is the case comment this.

4 - SOCKET BUFFERS

If the buffers are too small, like they are when default values are used, the TCP congestion window will never fully open up. If the buffers are too large, the sender can overrun the receiver, and the TCP window will shut down.

4.1 Socket buffers and bandwidth delay product

socket buffer size = 2* bandwidth * delay

Estimating an approximate delay of the path is straightforward with a tool such as ping (see tools section below). More difficult is to have an idea of the bandwidth available. Once again, you shouldn't attempt to transfer Gigabit/s of data when you haven't at least minimal control over all the links in the path. Otherwise tools like pchar and pathchar can be used to have an idea of the bottleneck bandwidth on a path. Note that these tools are not very reliable, since estimating the available bandwidth on a path is still an open research issue.

Note that you should change the socket buffer size in both sender and receiver with the same value. To change the buffer socket size with iperf you use the -W option.

When you are building an application you use the appropriate "set socket option" system call. Here is an example using C (other languages should use a similar construction):

int socket_descriptor, int sndsize; err = setsockopt (socket_descriptor, SOL_SOCKET, SO_SNDBUF, (char*)&sndsize, (int)sizeof(sndsize));

and in the receiver

int socket_descriptor, int rcvsize; err = setsockopt (socket_descriptor, SOL_SOCKET, SO_RCVBUF, (char*)&rcvsize, (int)sizeof(rcvsize));

to check what the buffer size is you can use the "get socket option system call:

int sockbufsize=0; int size=sizeof(int); err=getsockopt(socket_descriptor, SOL_SOCKET, SO_SNDBUF, (char*) &socketbufsize, &size);

4.2 socket buffer memory queue limits r|w mem (default and max)

/sbin/sysctl -w net.core.rmem_max= VALUE

Where VALUE should enough for your socket buffer size.

you should also set the "write" value.

/sbin/sysctl -w net.core.wmem_max= VALUE

/sbin/sysctl -w net.ipv4.tcp_mem= MIN DEFAULT MAX

This should be done both in the sender and the receiver.

4.3 autotuning in 2.4 kernels

/sbin/sysctl -w net.ipv4.tcp_rmem= MIN DEFAULT MAX

/sbin/sysctl -w net.ipv4.tcp_wmem= MIN DEFAULT MAX

5 - OTHER METHODS (Miguel)

5.1 Using large block sizes

5.2 Parallel streams

bbcp and gridFTP (see Tools appendix) are two file transfer tools that also allow the creation of parallel streams for data transfer. Be careful that when operations with disk access are involved performance is worse. Disks are frequently the bottleneck. A detailed analylis of disk performance is outside the scope of this document.

5.3 New TCP stack

6 - NETWORK SUPPORT (Miguel)

Jumbo frames should be used if possible

Increase Routers queue sizes. In Cisco equipment for example, the maximum, 4096, should me used.

Gigabit Ethernet Flow Control should be ON

Avoid Fragmentation.

Watch out for IPv6 MTU advertisements. Some routers have IPv6 router advertisement on by default. This usually advertises a small value (typically 1500). End systems should turn off listening to these advertisements. This is easily configured in the /proc file system

A - Tools

traceroute - Lists all routers between two hosts. Usually available by default in Linux

tcpdump - dumps all TCP header information for a specified source/destination. Very used and useful for network debugging.

pathchar - A tool to estimate the bandwdth available in all the links in a given path (not very reliable). http://www.caida.org/tools/utilities/others/pathchar/

iperf - currently the most used tool for traffic generation and measurement of end-to-end TCP/UDP performance.

bbftp - File Transfer Software, http://doc.in2p3.fr/bbftp/

GridFTP - File Transfer Software, http://www.globus.org/datagrid/gridftp.html

B - References