TL;DR

This blog post explains how computers running the Linux kernel send packets, as well as how to monitor and tune each component of the networking stack as packets flow from user programs to network hardware.

This post forms a pair with our previous post Monitoring and Tuning the Linux Networking Stack: Receiving Data.

Setup your own NPM registry for free. Sign up!

It is impossible to tune or monitor the Linux networking stack without reading the source code of the kernel and having a deep understanding of what exactly is happening.

This blog post will hopefully serve as a reference to anyone looking to do this.

General advice on monitoring and tuning the Linux networking stack

As mentioned in our previous article, the Linux network stack is complex and there is no one size fits all solution for monitoring or tuning. If you truly want to tune the network stack, you will have no choice but to invest a considerable amount of time, effort, and money into understanding how the various parts of networking system interact.

Many of the example settings provided in this blog post are used solely for illustrative purposes and are not a recommendation for or against a certain configuration or default setting. Before adjusting any setting, you should develop a frame of reference around what you need to be monitoring to notice a meaningful change.

Adjusting networking settings while connected to the machine over a network is dangerous; you could very easily lock yourself out or completely take out your networking. Do not adjust these settings on production machines; instead, make adjustments on new machines and rotate them into production, if possible.

Overview

For reference, you may want to have a copy of the device data sheet handy. This post will examine the Intel I350 Ethernet controller, controlled by the igb device driver. You can find that data sheet (warning: LARGE PDF) here for your reference.

The high-level path network data takes from a user program to a network device is as follows:

Data is written using a system call (like sendto , sendmsg , et. al.). Data passes through the socket subsystem on to the socket’s protocol family’s system (in our case, AF_INET ). The protocol family passes data through the protocol layers which (in many cases) arrange the data into packets. The data passes through the routing layer, populating the destination and neighbour caches along the way (if they are cold). This can generate ARP traffic if an ethernet address needs to be looked up. After passing through the protocol layers, packets reach the device agnostic layer. The output queue is chosen using XPS (if enabled) or a hash function. The device driver’s transmit function is called. The data is then passed on to the queue discipline (qdisc) attached to the output device. The qdisc will either transmit the data directly if it can, or queue it up to be sent during the NET_TX softirq. Eventually the data is handed down to the driver from the qdisc. The driver creates the needed DMA mappings so the device can read the data from RAM. The driver signals the device that the data is ready to be transmit. The device fetches the data from RAM and transmits it. Once transmission is complete, the device raises an interrupt to signal transmit completion. The driver’s registered IRQ handler for transmit completion runs. For many devices, this handler simply triggers the NAPI poll loop to start running via the NET_RX softirq. The poll function runs via a softIRQ and calls down into the driver to unmap DMA regions and free packet data.

This entire flow will be examined in detail in the following sections.

The protocol layers examined below are the IP and UDP protocol layers. Much of the information presented will serve as a reference for other protocol layers, as well.

Detailed Look

This blog post will be examining the Linux kernel version 3.13.0 with links to code on GitHub and code snippets throughout this post, much like the companion post.

Let’s begin by examining how protocol families are registered in the kernel and used by the socket subsystem, then we can proceed to receiving data.

Protocol family registration

What happens when you run a piece of code like this in a user program to create a UDP socket? sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)

In short, the Linux kernel looks up a set of functions exported by the UDP protocol stack that deal with many things including sending and receiving network data. To understand exactly how this work, we have to look into the AF_INET address family code.

The Linux kernel executes the inet_init function early during kernel initialization. This function registers the AF_INET protocol family, the individual protocol stacks within that family (TCP, UDP, ICMP, and RAW), and calls initialization routines to get protocol stacks ready to process network data. You can find the code for inet_init in ./net/ipv4/af_inet.c.

The AF_INET protocol family exports a structure that has a create function. This function is called by the kernel when a socket is created from a user program: static const struct net_proto_family inet_family_ops = { .family = PF_INET, .create = inet_create, .owner = THIS_MODULE, };

The inet_create function takes the arguments passed to the socket system call and searches the registered protocols to find a set of operations to link to the socket. Take a look: /* Look for the requested type/protocol pair. */ lookup_protocol: err = -ESOCKTNOSUPPORT; rcu_read_lock(); list_for_each_entry_rcu(answer, &inetsw[sock->type], list) { err = 0; /* Check the non-wild match. */ if (protocol == answer->protocol) { if (protocol != IPPROTO_IP) break; } else { /* Check for the two wild cases. */ if (IPPROTO_IP == protocol) { protocol = answer->protocol; break; } if (IPPROTO_IP == answer->protocol) break; } err = -EPROTONOSUPPORT; }

Later, answer which holds a reference to a particular protocol stack has its ops fields copied into the socket structure: sock->ops = answer->ops;

You can find the structure definitions for all of the protocol stacks in af_inet.c . Let’s take a look at the TCP and UDP protocol structures: /* Upon startup we insert all the elements in inetsw_array[] into * the linked list inetsw. */ static struct inet_protosw inetsw_array[] = { { .type = SOCK_STREAM, .protocol = IPPROTO_TCP, .prot = &tcp_prot, .ops = &inet_stream_ops, .no_check = 0, .flags = INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK, }, { .type = SOCK_DGRAM, .protocol = IPPROTO_UDP, .prot = &udp_prot, .ops = &inet_dgram_ops, .no_check = UDP_CSUM_DEFAULT, .flags = INET_PROTOSW_PERMANENT, }, /* .... more protocols ... */

In the case of IPPROTO_UDP , an ops structure is linked into place which contains functions for various things, including sending and receiving data: const struct proto_ops inet_dgram_ops = { .family = PF_INET, .owner = THIS_MODULE, /* ... */ .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, /* ... */ }; EXPORT_SYMBOL(inet_dgram_ops);

and a protocol-specific structure prot , which contains function pointers to all the internal UDP protocol stack function. For the UDP protocol, this structure is called udp_prot and is exported by ./net/ipv4/udp.c: struct proto udp_prot = { .name = "UDP", .owner = THIS_MODULE, /* ... */ .sendmsg = udp_sendmsg, .recvmsg = udp_recvmsg, /* ... */ }; EXPORT_SYMBOL(udp_prot);

Now, let’s turn to a user program that sends UDP data to see how udp_sendmsg is called in the kernel!

Create an RPM repository in less than 10 seconds, free. Sign up!

Sending network data via a socket

A user program wants to send UDP network data and so it uses the sendto system call, maybe like this: ret = sendto(socket, buffer, buflen, 0, &dest, sizeof(dest));

This system call passes through the Linux system call layer and lands in this function in ./net/socket.c : /* * Send a datagram to a given address. We move the address into kernel * space and check the user space data area is readable before invoking * the protocol. */ SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len, unsigned int, flags, struct sockaddr __user *, addr, int, addr_len) { /* ... code ... */ err = sock_sendmsg(sock, &msg, len); /* ... code ... */ }

The SYSCALL_DEFINE6 macro unfolds into a pile of macros, which in turn, set up the infrastructure needed to create a system call with 6 arguments (hence DEFINE6 ). One of the results of this is that inside the kernel, system call function names have sys_ prepended to them.

The system call code for sendto calls sock_sendmsg after arranging the data in a way that the lower layers will be able to handle. In particular, it takes the destination address passed into sendto and arranges it into a structure, let’s take a look: iov.iov_base = buff; iov.iov_len = len; msg.msg_name = NULL; msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_control = NULL; msg.msg_controllen = 0; msg.msg_namelen = 0; if (addr) { err = move_addr_to_kernel(addr, addr_len, &address); if (err < 0) goto out_put; msg.msg_name = (struct sockaddr *)&address; msg.msg_namelen = addr_len; }

This code is copying addr , passed in via the user program into the kernel data structure address , which is then embedded into a struct msghdr structure as msg_name . This is similar to what a userland program would do if it were calling sendmsg instead of sendto . The kernel provides this mutation because both sendto and sendmsg do call down to sock_sendmsg .

sock_sendmsg , __sock_sendmsg , and __sock_sendmsg_nosec

sock_sendmsg performs some error checking before calling __sock_sendmsg does its own error checking before calling __sock_sendmsg_nosec . __sock_sendmsg_nosec passes the data deeper into the socket subsystem: static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size) { struct sock_iocb *si = .... /* other code ... */ return sock->ops->sendmsg(iocb, sock, msg, size); }

As seen in the previous section explaining socket creation, the sendmsg function registered to this socket ops structure is inet_sendmsg .

inet_sendmsg

As you may have guessed from the name, this is a generic function provided by the AF_INET protocol family. This function starts by calling sock_rps_record_flow to record the last CPU that the flow was processed on; this is used by Receive Packet Steering. Next, this function looks up the sendmsg function on the socket’s internal protocol operations structure and calls it: int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size) { struct sock *sk = sock->sk; sock_rps_record_flow(sk); /* We may need to bind the socket. */ if (!inet_sk(sk)->inet_num && !sk->sk_prot->no_autobind && inet_autobind(sk)) return -EAGAIN; return sk->sk_prot->sendmsg(iocb, sk, msg, size); } EXPORT_SYMBOL(inet_sendmsg);

When dealing with UDP, sk->sk_prot->sendmsg above is udp_sendmsg as exported by the UDP protocol layer, via the udp_prot structure we saw earlier. This function call transitions from the generic AF_INET protocol family on to the UDP protocol stack.

UDP protocol layer

udp_sendmsg

The udp_sendmsg function can be found in ./net/ipv4/udp.c. The entire function is quite long, so we’ll examine pieces of it below. Follow the previous link if you’d like to read it in its entirety.

UDP corking

After variable declarations and some basic error checking, one of the first things udp_sendmsg does is check if the socket is “corked”. UDP corking is a feature that allows a user program request that the kernel accumulate data from multiple calls to send into a single datagram before sending. There are two ways to enable this option in your user program:

Use the setsockopt system call and pass UDP_CORK as the socket option. Pass MSG_MORE as one of the flags when calling send , sendto , or sendmsg from your program.

These options are documented in the UDP man page and the send / sendto / sendmsg man page, respectively.

The code from udp_sendmsg checks up->pending to determine if the socket is currently corked, and if so, it proceeds directly to appending data. We’ll see how data is appended later. int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len) { /* variables and error checking ... */ fl4 = &inet->cork.fl.u.ip4; if (up->pending) { /* * There are pending frames. * The socket lock must be held while it's corked. */ lock_sock(sk); if (likely(up->pending)) { if (unlikely(up->pending != AF_INET)) { release_sock(sk); return -EINVAL; } goto do_append_data; } release_sock(sk); }

Get the UDP destination address and port

Next, the destination address and port are determined from one of two possible sources:

The socket itself has the destination address stored because the socket was connected at some point. The address is passed in via an auxiliary structure, as we saw in the kernel code for sendto .

Here’s how the kernel deals with this: /* * Get and verify the address. */ if (msg->msg_name) { struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name; if (msg->msg_namelen < sizeof(*usin)) return -EINVAL; if (usin->sin_family != AF_INET) { if (usin->sin_family != AF_UNSPEC) return -EAFNOSUPPORT; } daddr = usin->sin_addr.s_addr; dport = usin->sin_port; if (dport == 0) return -EINVAL; } else { if (sk->sk_state != TCP_ESTABLISHED) return -EDESTADDRREQ; daddr = inet->inet_daddr; dport = inet->inet_dport; /* Open fast path for connected socket. Route will not be used, if at least one option is set. */ connected = 1; }

Yes, that is a TCP_ESTABLISHED in the UDP protocol layer! The socket states for better or worse use TCP state descriptions.

Recall earlier that we saw how the kernel arranges a struct msghdr structure on behalf of the user when the user program calls sendto . The code above shows how the kernel parses that data back out in order to set daddr and dport .

If the udp_sendmsg function was reached by kernel function which did not arrange a struct msghdr structure, the destination address and port are retrieved from the socket itself and the socket is marked as “connected.”

In either case daddr and dport will be set to the destination address and port.

Create a RubyGem repository in less than 10 seconds, free. Sign up!

Next, the source address, device index, and any timestamping options which were set on the socket (like SOCK_TIMESTAMPING_TX_HARDWARE , SOCK_TIMESTAMPING_TX_SOFTWARE , SOCK_WIFI_STATUS ) are retrieved and stored: ipc.addr = inet->inet_saddr; ipc.oif = sk->sk_bound_dev_if; sock_tx_timestamp(sk, &ipc.tx_flags);

Ancillary messages, via sendmsg

The sendmsg and recvmsg system calls allow the user to set or request ancillary data in addition to sending or receiving packets. User programs can make use of this ancillary data by crafting a struct msghdr with the request embedded in it. Many of the ancillary data types are documented in the man page for IP.

One popular example of ancillary data is IP_PKTINFO . In the case of sendmsg this data type allows the program to set a struct in_pktinfo to be used when sending data. The program can specify the source address to be used on the packet by filling in fields in the struct in_pktinfo structure. This is a useful option if the program is a server program listening on multiple IP addresses. In this case, the server program may want to reply to the client with the same IP address that the client used to contact the server. IP_PKTINFO enables precisely this use case.

Similarly, the IP_TTL and IP_TOS ancillary messages allow the user to set the IP packet TTL and TOS values on a per-packet basis, when passed with data to sendmsg from the user program. Note that both IP_TTL and IP_TOS may be set at the socket level for all outgoing packets by using setsockopt , instead of on a per-packet basis if desired. The Linux kernel translates the TOS value specified to a priority using an array. The priority affects how and when a packet is transmit from a queuing discipline. We’ll see more about what this means later.

We can see how the kernel handles ancillary messages for sendmsg on UDP sockets: if (msg->msg_controllen) { err = ip_cmsg_send(sock_net(sk), msg, &ipc, sk->sk_family == AF_INET6); if (err) return err; if (ipc.opt) free = 1; connected = 0; }

The internals of parsing the ancillary messages is handled by ip_cmsg_send from ./net/ipv4/ip_sockglue.c. Note that simply providing any ancillary data marks this socket as not connected.

Setting custom IP options

Next, sendmsg will check to see if the user specified any custom IP options with ancillary messages. If options were set, they will be used. If not, the options already in use by this socket will be used: if (!ipc.opt) { struct ip_options_rcu *inet_opt; rcu_read_lock(); inet_opt = rcu_dereference(inet->inet_opt); if (inet_opt) { memcpy(&opt_copy, inet_opt, sizeof(*inet_opt) + inet_opt->opt.optlen); ipc.opt = &opt_copy.opt; } rcu_read_unlock(); }

Next up, the function checks to see if the source record route (SRR) IP option is set. There are two types of source record routing: loose and strict source record routing. If this option was set, the first hop address is recorded and stored as faddr and the socket is marked as “not connected”. This will be used later: ipc.addr = faddr = daddr; if (ipc.opt && ipc.opt->opt.srr) { if (!daddr) return -EINVAL; faddr = ipc.opt->opt.faddr; connected = 0; }

After the SRR option is handled, the TOS IP flag is retrieved either from the value the user set via an ancillary message or the value currently in use by the socket. Followed by a check to determine if:

SO_DONTROUTE was set on the socket (with setsockopt ), or

was set on the socket (with ), or MSG_DONTROUTE was specified as a flag when calling sendto or sendmsg , or

was specified as a flag when calling or , or is_strictroute was set, indicating that strict source record routing is desired

Then, the tos has 0x1 ( RTO_ONLINK ) added to its bit set and the socket is considered not “connected”: tos = get_rttos(&ipc, inet); if (sock_flag(sk, SOCK_LOCALROUTE) || (msg->msg_flags & MSG_DONTROUTE) || (ipc.opt && ipc.opt->opt.is_strictroute)) { tos |= RTO_ONLINK; connected = 0; }

Multicast or unicast?

Next, the code attempts to deal with multicast. This is a bit tricky, as the user could specify an alternate source address or device index of where to send the packet from by sending an ancillary IP_PKTINFO message, as explained earlier.

If the destination address is a multicast address:

The device index of where to write the packet will be set to the multicast device index, and The source address on the packet will be set to the multicast source address.

Unless, the user has not overridden the device index by sending the IP_PKTINFO ancillary message. Let’s take a look: if (ipv4_is_multicast(daddr)) { if (!ipc.oif) ipc.oif = inet->mc_index; if (!saddr) saddr = inet->mc_addr; connected = 0; } else if (!ipc.oif) ipc.oif = inet->uc_index;

If the destination address is not a multicast address, the device index is set unless it was overridden by the user with IP_PKTINFO .

Routing

Now it’s time for routing!

The code in the UDP layer that deals with routing begins with a fast path. If the socket is connected try to get the routing structure: if (connected) rt = (struct rtable *)sk_dst_check(sk, 0);

If the socket was not connected, or if it was but the routing helper sk_dst_check decided the route was obsolete the code moves into the slow path to generate a routing structure. This begins by calling flowi4_init_output to construct a structure describing this UDP flow: if (rt == NULL) { struct net *net = sock_net(sk); fl4 = &fl4_stack; flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE, sk->sk_protocol, inet_sk_flowi_flags(sk)|FLOWI_FLAG_CAN_SLEEP, faddr, saddr, dport, inet->inet_sport);

Once this flow structure has been constructed, the socket and its flow structure are passed along to the security subsystem so that systems like SELinux or SMACK can set a security id value on the flow structure. Next, ip_route_output_flow will call into the IP routing code to generate a routing structure for this flow: security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); rt = ip_route_output_flow(net, fl4, sk);

If a routing structure could not be generated and the error was ENETUNREACH , the OUTNOROUTES statistic counter is incremented. if (IS_ERR(rt)) { err = PTR_ERR(rt); rt = NULL; if (err == -ENETUNREACH) IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES); goto out; }

The location of the file holding these statistics counter and the other available counters and their meanings will be discussed below in the UDP monitoring section.

Next, if the route is for broadcast, but the socket option SOCK_BROADCAST was not set on the socket the code terminates. If the socket is considered “connected” (as described throughout this function), the routing structure is cached on the socket: err = -EACCES; if ((rt->rt_flags & RTCF_BROADCAST) && !sock_flag(sk, SOCK_BROADCAST)) goto out; if (connected) sk_dst_set(sk, dst_clone(&rt->dst));

Create a Python repository in less than 10 seconds, free. Sign up!

Prevent the ARP cache from going stale with MSG_CONFIRM

If the user specified the MSG_CONFIRM flag when calling send , sendto , or sendmsg , the UDP protocol layer will now handle that: if (msg->msg_flags&MSG_CONFIRM) goto do_confirm; back_from_confirm:

This flag indicates to the system to confirm that the ARP cache entry is still valid and prevents it from being garbage collected. The dst_confirm function simply sets a flag on destination cache entry which will be checked much later when the neighbour cache has been queried and an entry has been found. We’ll see this again later. This feature is commonly used in UDP networking applications to reduce unnecessary ARP traffic. The do_confirm label is found near the end of this function, but it is straightforward: do_confirm: dst_confirm(&rt->dst); if (!(msg->msg_flags&MSG_PROBE) || len) goto back_from_confirm; err = 0; goto out;

This code confirms the cache entry and jumps back to back_from_confirm , if this was not a probe.

Once the do_confirm code jumps back to back_from_confirm (or no jump happened to do_confirm in the first place), the code will attempt to deal with both the UDP cork and uncorked cases next.

Fast path for uncorked UDP sockets: Prepare data for transmit

If UDP corking is not requested, the data can be packed into a struct sk_buff and passed on to udp_send_skb to move down the stack and closer to the IP protocol layer. This is done by calling ip_make_skb . Note that the routing structure generated earlier by calling ip_route_output_flow is passed in as well. It will be affixed to the skb and used later in the IP protocol layer. /* Lockless fast path for the non-corking case. */ if (!corkreq) { skb = ip_make_skb(sk, fl4, getfrag, msg->msg_iov, ulen, sizeof(struct udphdr), &ipc, &rt, msg->msg_flags); err = PTR_ERR(skb); if (!IS_ERR_OR_NULL(skb)) err = udp_send_skb(skb, fl4); goto out; }

The ip_make_skb function will attempt to construct an skb taking into consideration a wide range of things, like:

The MTU.

UDP corking (if enabled).

UDP Fragmentation Offloading (UFO).

Fragmentation, if UFO is unsupported and the size of the data to transmit is larger than the MTU.

Most network device drivers do not support UFO because the network hardware itself does not support this feature. Let’s take a look through this code, keeping in mind that corking is disabled. We’ll look at the corking enabled path next.

ip_make_skb

The ip_make_skb function can be found in ./net/ipv4/ip_output.c. This function is a bit tricky. The lower level code that ip_make_skb needs to use in order to build an skb requires a corking structure and queue where the skb will be queued to be passed in. In the case where the socket is not corked, a faux corking structure and empty queue are passed in as dummies.

Let’s take a look at how the faux corking structure and queue are setup: struct sk_buff *ip_make_skb(struct sock *sk, /* more args */) { struct inet_cork cork; struct sk_buff_head queue; int err; if (flags & MSG_PROBE) return NULL; __skb_queue_head_init(&queue); cork.flags = 0; cork.addr = 0; cork.opt = NULL; err = ip_setup_cork(sk, &cork, /* more args */); if (err) return ERR_PTR(err);

As seen above, both the corking structure ( cork ) and the queue ( queue ) are stack-allocated; neither are needed by the time ip_make_skb has completed. The faux corking structure is setup with a call to ip_setup_cork which allocates memory and initializes the structure. Next, __ip_append_data is called and the queue and corking structure are passed in: err = __ip_append_data(sk, fl4, &queue, &cork, ¤t->task_frag, getfrag, from, length, transhdrlen, flags);

We’ll see how this function works later, as it is used in both cases whether the socket is corked or not. For now, all we need to know is that __ip_append_data will create an skb, append data to it, and add that skb to the queue passed in. If appending the data failed, __ip_flush_pending_frame is called to drop the data on the floor and the error code is passed back upward: if (err) { __ip_flush_pending_frames(sk, &queue, &cork); return ERR_PTR(err); }

Finally, if no error occurred, __ip_make_skb will dequeue the queued skb, add the IP options, and return an skb that is ready to be passed on to lower layers for sending: return __ip_make_skb(sk, fl4, &queue, &cork);

Transmit the data!

If no errors occurred, the skb is handed to udp_send_skb which will pass the skb to the next layer of the networking stack, the IP protocol stack: err = PTR_ERR(skb); if (!IS_ERR_OR_NULL(skb)) err = udp_send_skb(skb, fl4); goto out;

If there was an error, it will be accounted later. See the “Error Accounting” section below the UDP corking case for more information.

Slow path for corked UDP sockets with no preexisting corked data

If UDP corking is being used, but no preexisting data is corked, the slow path commences:

Lock the socket. Check for an application bug: a corked socket that is being “re-corked”. The flow structure for this UDP flow is prepared for corking. The data to be sent is appended to existing data.

You can see this in the next piece of code, continuing down udp_sendmsg : lock_sock(sk); if (unlikely(up->pending)) { /* The socket is already corked while preparing it. */ /* ... which is an evident application bug. --ANK */ release_sock(sk); LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("cork app bug 2

")); err = -EINVAL; goto out; } /* * Now cork the socket to pend data. */ fl4 = &inet->cork.fl.u.ip4; fl4->daddr = daddr; fl4->saddr = saddr; fl4->fl4_dport = dport; fl4->fl4_sport = inet->inet_sport; up->pending = AF_INET; do_append_data: up->len += ulen; err = ip_append_data(sk, fl4, getfrag, msg->msg_iov, ulen, sizeof(struct udphdr), &ipc, &rt, corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);

ip_append_data

The ip_append_data is a small wrapper function which does two major things prior to calling down to __ip__append_data :

Checks if the MSG_PROBE flag was passed in from the user. This flag indicates that the user does not want to really send data. The path should be probed (for example to determine the PMTU). Checks if the socket’s send queue is empty. If so, this means that there is no corked data pending, so ip_setup_cork is called to setup corking.

Once the above conditions are dealt with the __ip_append_data function is called which contains the bulk of the logic for processing data into packets.

Create a RubyGem repository in less than 10 seconds, free. Sign up!

__ip_append_data

This function is called in either from ip_append_data if the socket is corked or from ip_make_skb if the socket is not corked. In either case, this function will either allocate a new buffer to store the data passed in or will append the data with existing data.

The way this work centers around the socket’s send queue. Existing data waiting to be sent (for example, if the socket is corked) will have an entry in the queue where additional data can be appended.

This function is complex; it performs several rounds of calculations to determine how to construct the skb that will be passed to the lower level networking layers and examining the buffer allocation process in detail is not strictly necessary for understanding how network data is transmit.

The important highlights of this function include:

Handling UDP fragmentation offloading (UFO), if supported by the hardware. The vast majority of network hardware does not support UFO. If your network card’s driver does support it, it will set the feature flag NETIF_F_UFO . Handling network cards that support scatter/gather IO. Many cards support this and it is advertised with the NETIF_F_SG feature flag. The availability of this feature indicates that a network card can deal with transmitting a packet where the data has been split amongst a set of buffers; the kernel does not need to spend time coalescing multiple buffers into a single buffer. Avoiding this additional copying is desired and most network cards support this. Tracking the size of the send queue via calls to sock_wmalloc . When a new skb is allocated, the size of the skb is charged to the socket which owns it and the allocated bytes for a socket’s send queue are incremented. If there was not sufficient space in the send queue, the skb is not allocated and an error is returned and tracked. We’ll see how to set the socket send queue size in the tuning section below. Incrementing error statistics. Any error in this function increments “discard”. We’ll see how to read this value in the monitoring section below.

Upon successful completion of this function, 0 is returned and the data to be transmit will be assembled into an skb that is appropriate for the network device and is waiting on the send queue.

In the uncorked case, the queue holding the skb is passed to __ip_make_skb described above where it is dequeued and prepared to be sent to the lower layers via udp_send_skb .

In the corked case, the return value of __ip_append_data is passed upward. The data sits on the send queue until udp_sendmsg determines it is time to call udp_push_pending_frames which will finalize the skb and call udp_send_skb .

Flushing corked sockets

Now, udp_sendmsg will move on to check the return value ( err below) from __ip_append_skb : if (err) udp_flush_pending_frames(sk); else if (!corkreq) err = udp_push_pending_frames(sk); else if (unlikely(skb_queue_empty(&sk->sk_write_queue))) up->pending = 0; release_sock(sk);

Let’s take a look at each of these cases:

If there is an error ( err is non-zero), then udp_flush_pending_frames is called, which cancels corking and drops all data from the socket’s send queue. If this data was sent without MSG_MORE specified, called udp_push_pending_frames which will attempt to deliver the data to the lower networking layers. If the send queue is empty, mark the socket as no longer corking.

If the append operation completed successfully and there is more data to cork coming, the code continues by cleaning up and returning the length of the data appended: ip_rt_put(rt); if (free) kfree(ipc.opt); if (!err) return len;

That is how the kernel deals with corked UDP sockets.

Error accounting

If:

The non-corking fast path failed to make an skb or udp_send_skb reports an error, or ip_append_data fails to append data to a corked UDP socket, or udp_push_pending_frames returns an error received from udp_send_skb when trying to transmit a corked skb

the SNDBUFERRORS statistic will be incremented only if the error received was ENOBUFS (no kernel memory available) or the socket has SOCK_NOSPACE set (the send queue is full): /* * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space. Reporting * ENOBUFS might not be good (it's not tunable per se), but otherwise * we don't have a good statistic (IpOutDiscards but it can be too many * things). We could add another new stat but at least for now that * seems like overkill. */ if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) { UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_SNDBUFERRORS, is_udplite); } return err;

We’ll see how to read these counters in the monitoring section below.

udp_send_skb

The udp_send_skb function is how udp_sendmsg will eventually push an skb down to the next layer of the networking stack, in this case the IP protocol layer. This function does a few important things:

Adds a UDP header to the skb. Deals with checksums: software checksums, hardware checksums, or no checksum (if disabled). Attempts to send the skb to the IP protocol layer by calling ip_send_skb . Increments statistics counters for successful or failed transmissions.

Let’s take a look. First, a UDP header is created: static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4) { /* useful variables ... */ /* * Create a UDP header */ uh = udp_hdr(skb); uh->source = inet->inet_sport; uh->dest = fl4->fl4_dport; uh->len = htons(len); uh->check = 0;

Next, checksumming is handled. There’s a few cases:

UDP-Lite checksums are handled first. Next, if the socket is set to not generate checksums at all (via setsockopt with SO_NO_CHECK ), it will be marked as such. Next, if the hardware supports UDP checksums, udp4_hwcsum will be called to set that up. Note that the kernel will generate checksums in software if the packet is fragmented. You can see this in the source for udp4_hwcsum . Lastly, a software checksum is generated with a call to udp_csum .

if ( is_udplite ) /* UDP-Lite */ csum = udplite_csum ( skb ); else if ( sk -> sk_no_check == UDP_CSUM_NOXMIT ) { /* UDP csum disabled */ skb -> ip_summed = CHECKSUM_NONE ; goto send ; } else if ( skb -> ip_summed == CHECKSUM_PARTIAL ) { /* UDP hardware csum */ udp4_hwcsum ( skb , fl4 -> saddr , fl4 -> daddr ); goto send ; } else csum = udp_csum ( skb );

Next, the psuedo header is added: uh->check = csum_tcpudp_magic(fl4->saddr, fl4->daddr, len, sk->sk_protocol, csum); if (uh->check == 0) uh->check = CSUM_MANGLED_0;

If the checksum is 0, the equivalent in one’s complement is set as the checksum, per RFC 768. Finally, the skb is passed to the IP protocol stack and statistics are incremented: send: err = ip_send_skb(sock_net(sk), skb); if (err) { if (err == -ENOBUFS && !inet->recverr) { UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_SNDBUFERRORS, is_udplite); err = 0; } } else UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_OUTDATAGRAMS, is_udplite); return err;

If ip_send_skb completes successfully, the OUTDATAGRAMS statistic is incremented. If the IP protocol layer reports an error, SNDBUFERRORS is incremented, but only if the error is ENOBUFS (lack of kernel memory) and there is no error queue enabled.

Before moving on to the IP protocol layer, let’s take a look at how to monitor and tune the UDP protocol layer in the Linux kernel.

Monitoring: UDP protocol layer statistics

Create an APT repository in less than 10 seconds, free. Sign up!

Two very useful files for getting UDP protocol statistics are:

/proc/net/snmp

/proc/net/udp

/proc/net/snmp

Monitor detailed UDP protocol statistics by reading /proc/net/snmp . $ cat /proc/net/snmp | grep Udp\: Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors Udp: 16314 0 0 17161 0 0

In order to understand precisely where these statistics are incremented, you will need to carefully read the kernel source. There are a few cases where some errors are counted in more than one statistic.

InDatagrams : Incremented when recvmsg was used by a userland program to read datagram. Also incremented when a UDP packet is encapsulated and sent back for processing.

: Incremented when was used by a userland program to read datagram. Also incremented when a UDP packet is encapsulated and sent back for processing. NoPorts : Incremented when UDP packets arrive destined for a port where no program is listening.

: Incremented when UDP packets arrive destined for a port where no program is listening. InErrors : Incremented in several cases: no memory in the receive queue, when a bad checksum is seen, and if sk_add_backlog fails to add the datagram.

: Incremented in several cases: no memory in the receive queue, when a bad checksum is seen, and if fails to add the datagram. OutDatagrams : Incremented when a UDP packet is handed down without error to the IP protocol layer to be sent.

: Incremented when a UDP packet is handed down without error to the IP protocol layer to be sent. RcvbufErrors : Incremented when sock_queue_rcv_skb reports that no memory is available; this happens if sk->sk_rmem_alloc is greater than or equal to sk->sk_rcvbuf .

: Incremented when reports that no memory is available; this happens if is greater than or equal to . SndbufErrors : Incremented if the IP protocol layer reported an error when trying to send the packet and no error queue has been setup. Also incremented if no send queue space or kernel memory are available.

: Incremented if the IP protocol layer reported an error when trying to send the packet and no error queue has been setup. Also incremented if no send queue space or kernel memory are available. InCsumErrors : Incremented when a UDP checksum failure is detected. Note that in all cases I could find, InCsumErrors is incremented at the same time as InErrors . Thus, InErrors - InCsumErros should yield the count of memory related errors on the receive side.

Note that some errors discovered by the UDP protocol layer are reported in the statistics files for other protocol layers. One example of this: routing errors. A routing error discovered by udp_sendmsg will cause an increment to the IP protocol layer’s OutNoRoutes statistic.

/proc/net/udp

Monitor UDP socket statistics by reading /proc/net/udp $ cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000 00000000 104 0 7518 2 0000000000000000 0 558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7408 2 0000000000000000 0 588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7511 2 0000000000000000 0 769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7673 2 0000000000000000 0 812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7407 2 0000000000000000 0

The first line describes each of the fields in the lines following:

sl : Kernel hash slot for the socket

: Kernel hash slot for the socket local_address : Hexadecimal local address of the socket and port number, separated by : .

: Hexadecimal local address of the socket and port number, separated by . rem_address : Hexadecimal remote address of the socket and port number, separated by : .

: Hexadecimal remote address of the socket and port number, separated by . st : The state of the socket. Oddly enough, the UDP protocol layer seems to use some TCP socket states. In the example above, 7 is TCP_CLOSE .

: The state of the socket. Oddly enough, the UDP protocol layer seems to use some TCP socket states. In the example above, is . tx_queue : The amount of memory allocated in the kernel for outgoing UDP datagrams.

: The amount of memory allocated in the kernel for outgoing UDP datagrams. rx_queue : The amount of memory allocated in the kernel for incoming UDP datagrams.

: The amount of memory allocated in the kernel for incoming UDP datagrams. tr , tm->when , retrnsmt : These fields are unused by the UDP protocol layer.

, , : These fields are unused by the UDP protocol layer. uid : The effective user id of the user who created this socket.

: The effective user id of the user who created this socket. timeout : Unused by the UDP protocol layer.

: Unused by the UDP protocol layer. inode : The inode number corresponding to this socket. You can use this to help you determine which user process has this socket open. Check /proc/[pid]/fd , which will contain symlinks to socket[:inode] .

: The inode number corresponding to this socket. You can use this to help you determine which user process has this socket open. Check , which will contain symlinks to . ref : The current reference count for the socket.

: The current reference count for the socket. pointer : The memory address in the kernel of the struct sock .

: The memory address in the kernel of the . drops : The number of datagram drops associated with this socket. Note that this does not include any drops related to sending datagrams (on corked UDP sockets or otherwise); this is only incremented in receive paths as of the kernel version examined by this blog post.

The code which outputs this can be found in net/ipv4/udp.c .

Tuning: Socket send queue memory

The maximum size of the send queue (also called the write queue) can be adjusted by setting the net.core.wmem_max sysctl.

Increase the maximum send buffer size by setting a sysctl . $ sudo sysctl -w net.core.wmem_max=8388608

sk->sk_write_queue starts at the net.core.wmem_default value, which can also be adjusted by setting a sysctl, like so:

Adjust the default initial send buffer size by setting a sysctl . $ sudo sysctl -w net.core.wmem_default=8388608

You can also set the sk->sk_write_queue size by calling setsockopt from your application and passed SO_SNDBUF . The maximum you can set with setsockopt is net.core.wmem_max .

However, you can override the net.core.wmem_max limit by calling setsockopt and passing SO_SNDBUFFORCE , but the user running the application need the CAP_NET_ADMIN capability.

The sk->sk_wmem_alloc is incremented each time an skb is allocated by calls to __ip_append_data . As we’ll see, UDP datagrams are transmit quickly and typically don’t spend much time in the send queue.

IP protocol layer

The UDP protocol layer hands skbs down to the IP protocol by simply calling ip_send_skb , so let’s start there and map out the IP protocol layer!

ip_send_skb

The ip_send_skb function is found in ./net/ipv4/ip_output.c and is very short. It simply calls down to ip_local_out and bumps an error statistic if ip_local_out returns an error of some sort. Let’s take a look: int ip_send_skb(struct net *net, struct sk_buff *skb) { int err; err = ip_local_out(skb); if (err) { if (err > 0) err = net_xmit_errno(err); if (err) IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS); } return err; }

As seen above, ip_local_out is called and the return value is dealt with after that. The call to net_xmit_errno helps to “translate” any errors from lower levels into an error that is understood by the IP and UDP protocol layers. If any error happens, the IP protocol statistic “OutDiscards” is incremented. We’ll see later which files to read to obtain this statistic. For now, let’s continue down the rabbit hole and see where ip_local_out takes us.

ip_local_out and __ip_local_out

Luckily for us, both ip_local_out and __ip_local_out are simple. ip_local_out simply calls down to __ip_local_out and based on the return value, will call into the routing layer to output the packet: int ip_local_out(struct sk_buff *skb) { int err; err = __ip_local_out(skb); if (likely(err == 1)) err = dst_output(skb); return err; }

We can see from the source to __ip_local_out that the function does two important things first:

Sets the length of the IP packet Calls ip_send_check to compute the checksum to be written in the IP packet header. The ip_send_check function will call a function named ip_fast_csum to compute the checksum. On the x86 and x86_64 architectures, this function is implemented in assembly. You can read the 64bit implementation here and the 32bit implementation here.

Next, the IP protocol layer will call down into netfilter by calling nf_hook . The return value of the nf_hook function will be passed back up to ip_local_out . If nf_hook returns 1 , this indicates that the packet was allowed to pass and that the caller should pass it along itself. As we saw above, this is precisely what happens: ip_local_out checks for the return value of 1 and passes the packet on by calling dst_output itself. Let’s take a look at the code for __ip_local_out : int __ip_local_out(struct sk_buff *skb) { struct iphdr *iph = ip_hdr(skb); iph->tot_len = htons(skb->len); ip_send_check(iph); return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, skb, NULL, skb_dst(skb)->dev, dst_output); }

netfilter and nf_hook

In the interest of brevity (and my RSI), I’ve decided to skip my deep dive into netfilter, iptables, and conntrack. You can dive into the source for netfilter by starting here and here.

The short version is that nf_hook is a wrapper which calls nf_hook_thresh that first checks if any filters are installed for the specified protocol family and hook type ( NFPROTO_IPV4 and NF_INET_LOCAL_OUT in this case, respectively) and attempt to return execution back to the IP protocol layer to avoid going deeper into netfilter and anything that hooks in below that like iptables and conntrack.

Keep in mind: if you have numerous or very complex netfilter or iptables rules, those rules will be executed in the CPU context of the user process which initiated the original sendmsg call. If you have CPU pinning set up to restrict execution of this process to a particular CPU (or set of CPUs), be aware that the CPU will spend system time processing outbound iptables rules. Depending on your system’s workload, you may want to carefully pin processes to CPUs or reduce the complexity of your ruleset if you measure a performance regression here.

For the purposes of our discussion, let’s assume nf_hook returns 1 indicating that the caller (in this case, the IP protocol layer) should pass the packet along itself.

Destination cache

Create a Python repository in less than 10 seconds, free. Sign up!

The dst code implements the protocol independent destination cache in the Linux kernel. To understand how dst entries are setup to proceed with the sending of UDP datagrams, we need to briefly examine how dst entries and routes are generated. The destination cache, routing, and neighbour subsystems can all be examined in extreme detail on their own. For our purposes, we can take a quick look to see how this all fits together.

The code we’ve seen above calls dst_output(skb) . This function simply looks up the dst entry attached to the skb and calls the output function. Let’s take a look: /* Output packet to network from transport. */ static inline int dst_output(struct sk_buff *skb) { return skb_dst(skb)->output(skb); }

Seems simple enough, but how does that output function get attached to the dst entry in the first place?

It’s important to understand that destination cache entries are added in many different ways. One way we’ve seen so far in the code path we’ve been following is with the call to ip_route_output_flow from udp_sendmsg . The ip_route_output_flow function calls __ip_route_output_key which calls __mkroute_output . The __mkroute_output function creates the route and the destination cache entry. When it does so, it determines which of the output functions is appropriate for this destination. Most of the time, this function is ip_output .

ip_output

So, dst_output executes the output function, which in the UDP IPv4 case is ip_output . The ip_output function is straightforward: int ip_output(struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); skb->dev = dev; skb->protocol = htons(ETH_P_IP); return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, skb, NULL, dev, ip_finish_output, !(IPCB(skb)->flags & IPSKB_REROUTED)); }

First, a statistics counter is updated IPSTATS_MIB_OUT . The IP_UPD_PO_STATS macro will increment both the number of bytes and number packets. We’ll see in a later section how to obtain the IP protocol layer statistics and what each of them mean. Next, the device for this skb to be transmit on is set, as is the protocol.

Finally, control is passed off to netfilter with a call to NF_HOOK_COND . Looking at the function prototype for NF_HOOK_COND will help make the explanation of how it works a bit clearer. From ./include/linux/netfilter.h: static inline int NF_HOOK_COND(uint8_t pf, unsigned int hook, struct sk_buff *skb, struct net_device *in, struct net_device *out, int (*okfn)(struct sk_buff *), bool cond)

NF_HOOK_COND works by checking the conditional, which is passed in. In this case, that conditional is !(IPCB(skb)->flags & IPSKB_REROUTED . If this conditional is true, then the skb will be passed on to netfilter. If netfilter allows the packet to pass, the okfn is called. In this case, the okfn is ip_finish_output .

ip_finish_output

The ip_finish_output function is also short and clear. Let’s take a look: static int ip_finish_output(struct sk_buff *skb) { #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) /* Policy lookup after SNAT yielded a new policy */ if (skb_dst(skb)->xfrm != NULL) { IPCB(skb)->flags |= IPSKB_REROUTED; return dst_output(skb); } #endif if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb)) return ip_fragment(skb, ip_finish_output2); else return ip_finish_output2(skb); }

If netfilter and packet transformation are enabled in this kernel, the skb ’s flags are updated and it is sent back through dst_output . The two more common cases are:

If packet’s length is larger than the MTU and the packet’s segmentation will not be offloaded to the device, ip_fragment is called to help fragment the packet prior to transmission. Otherwise, the packet is passed straight through to ip_finish_output2 .

Let’s take a short detour to talk about Path MTU Discovery before continuing our way through the kernel.

Path MTU Discovery

Linux provides a feature I’ve avoided mentioning until now: Path MTU Discovery. This feature allows the kernel to automatically determine the largest MTU for a particular route. Determining this value and sending packets that are less than or equal to the MTU for the route means that IP fragmentation can be avoided. This is the preferred setting because fragmenting packets consumes system resources and is seemingly easy to avoid: simply send small enough packets and fragmentation is unnecessary.

You can adjust the Path MTU Discovery settings on a per-socket basis by calling setsockopt in your application with the SOL_IP level and IP_MTU_DISCOVER optname. The optval can be one of the several values described in the IP protocol man page. The value you’ll likely want to set is: IP_PMTUDISC_DO which means “Always do Path MTU Discovery.” More advanced network applications or diagnostic tools may choose to implement RFC 4821 themselves to determine the PMTU at application start for a particular route or routes. In this case, you can use the IP_PMTUDISC_PROBE option which tells the kernel to set the “Don’t Fragment” bit, but allows you to send data larger than the PMTU.

Your application can retrieve the PMTU by calling getsockopt , with the SOL_IP and IP_MTU optname. You can use this to help guide the size of the UDP datagrams your application will construct prior to attempting transmissions.

If you have enabled PTMU discovery, any attempt to send UDP data larger than the PMTU will result in the application receiving the error code EMSGSIZE . The application can then retry, but with less data.

Enabling PTMU discovery is strongly encouraged, so I’ll avoid describing the IP fragmentation code path in detail. When we take a look at the IP protocol layer statistics, I’ll explain all the statistics including the fragmentation related statistics. Many of them are incremented in ip_fragment . In both the fragment or non-fragment case ip_finish_output2 is called, so let’s continue there.

ip_finish_output2

The ip_finish_output2 is called after IP fragmentation and also directly from ip_finish_output . This function handles bumping various statistics counters prior to handing the packet down to the neighbour cache. Let’s see how this works: static inline int ip_finish_output2(struct sk_buff *skb) { /* variable declarations */ if (rt->rt_type == RTN_MULTICAST) { IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTMCAST, skb->len); } else if (rt->rt_type == RTN_BROADCAST) IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTBCAST, skb->len); /* Be paranoid, rather than too clever. */ if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) { struct sk_buff *skb2; skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev)); if (skb2 == NULL) { kfree_skb(skb); return -ENOMEM; } if (skb->sk) skb_set_owner_w(skb2, skb->sk); consume_skb(skb); skb = skb2; }

If the routing structure associated with this packet is of type multicast, both the OutMcastPkts and OutMcastOctets counters are bumped by using the IP_UPD_PO_STATS macro. Otherwise, if the route type is broadcast the OutBcastPkts and OutBcastOctets counters are bumped.

Next, a check is performed to ensure that the skb structure has enough room for any link layer headers that need to be added. If not, additional room is allocated with a call to skb_realloc_headroom and the cost of the new skb is charged to the associated socket. rcu_read_lock_bh(); nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr); neigh = __ipv4_neigh_lookup_noref(dev, nexthop); if (unlikely(!neigh)) neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

Continuing on, we can see that the next hop is computed by querying the routing layer followed by a lookup against the neighbour cache. If the neighbour is not found, one is created by calling __neigh_create . This could be the case, for example, the first time data is sent to another host. Note that this function is called with arp_tbl (defined in ./net/ipv4/arp.c) to create the neighbour entry in the ARP table. Other systems (like IPv6 or DECnet) maintain their own ARP tables and would pass a different structure into __neigh_create . This post does not aim to cover the neighbour cache in full detail, but it is worth nothing that if the neighbour has to be created it is possible that this creation can cause the cache to grow. This post will cover some more details about the neighbour cache in the sections below. At any rate, the neighbour cache exports its own set of statistics so that this growth can be measured. See the monitoring sections below for more information. if (!IS_ERR(neigh)) { int res = dst_neigh_output(dst, neigh, skb); rcu_read_unlock_bh(); return res; } rcu_read_unlock_bh(); net_dbg_ratelimited("%s: No header cache and no neighbour!

", __func__); kfree_skb(skb); return -EINVAL; }

Finally, if no error is returned, dst_neigh_output is called to pass the skb along on its journey to be output. Otherwise, the skb is freed and EINVAL is returned. An error here will ripple back and cause OutDiscards to be incremented way back up in ip_send_skb . Let’s continue on in dst_neigh_output and continue approaching the Linux kernel’s netdevice subsystem.

dst_neigh_output

Easy to use Maven repositories, free. Sign up!

The dst_neigh_output function does two important things for us. First, recall from earlier in this blog post we saw that if a user specified MSG_CONFIRM via an ancillary message to sendmsg the function, a flag is flipped to indicate that the destination cache entry for the remote host is still valid and should not be garbage collected. That check happens here and the confirmed field on the neighbour is set to the current jiffies count. static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n, struct sk_buff *skb) { const struct hh_cache *hh; if (dst->pending_confirm) { unsigned long now = jiffies; dst->pending_confirm = 0; /* avoid dirtying neighbour */ if (n->confirmed != now) n->confirmed = now; }

Second, the neighbour’s state is checked and the appropriate output function is called. Let’s take a look at the conditional and try to understand what’s going on: hh = &n->hh; if ((n->nud_state & NUD_CONNECTED) && hh->hh_len) return neigh_hh_output(hh, skb); else return n->output(n, skb); }

If a neighbour is considered NUD_CONNECTED , meaning it is one or more of:

NUD_PERMANENT : A static route.

: A static route. NUD_NOARP : Does not require an ARP request (for example, the destination is a multicast or broadcast address, or a loopback device).

: Does not require an ARP request (for example, the destination is a multicast or broadcast address, or a loopback device). NUD_REACHABLE : The neighbour is “reachable.” A destination is marked as reachable whenever an ARP request for it is successfully processed.

and the “hardware header” ( hh ) is cached (because we’ve sent data before and have previously generated it), call neigh_hh_output . Otherwise, call the output function. Both code paths end with dev_queue_xmit which pass the skb down to the Linux net device subsystem where it will be processed a bit more before hitting the device driver layer. Let’s follow both the neigh_hh_output and n->output code paths until we reach dev_queue_xmit .

neigh_hh_output

If the destination is NUD_CONNECTED and the hardware header has been cached, neigh_hh_output will be called, which does a small bit of processing before handing the skb over to dev_queue_xmit . Let’s take a look, from ./include/net/neighbour.h: static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff *skb) { unsigned int seq; int hh_len; do { seq = read_seqbegin(&hh->hh_lock); hh_len = hh->hh_len; if (likely(hh_len <= HH_DATA_MOD)) { /* this is inlined by gcc */ memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD); } else { int hh_alen = HH_DATA_ALIGN(hh_len); memcpy(skb->data - hh_alen, hh->hh_data, hh_alen); } } while (read_seqretry(&hh->hh_lock, seq)); skb_push(skb, hh_len); return dev_queue_xmit(skb); }

This function is a bit tricky to understand, partially due to the locking primitive used to synchronize reading/writing on the cached hardware header. This code uses something called a seqlock. You can imagine the do { } while() loop above as a simple retry mechanism which will attempt to perform the operations in the loop until it can be performed successfully.

The loop itself is attempted to determine if the hardware header’s length needs to be aligned prior to being copied. This is required because some hardware headers (like the IEEE 802.11 header) is larger than HH_DATA_MOD (16 bytes).

Once the data is copied to the skb and the skb’s internal pointers tracking the data are updated with skb_push , the skb is passed to dev_queue_xmit to enter the Linux net device subsystem.

n->output

If the destination is not NUD_CONNECTED or the hardware header has not been cached the code proceeds down the n->output path. What is attached to the output function pointer on the neigbour structure? Well, it depends. To understand how this is setup, we’ll need to understand a bit more about how the neighbour cache works.

A struct neighbour contains several important fields. The nud_state field as we saw above, an output function, and an ops structure. Recall how earlier we saw that __neigh_create is called from ip_finish_output2 if no existing entry was found in the cache. When __neigh_creaet is called a neighbour is allocated with its output function initially set to neigh_blackhole . As the __neigh_create code progresses, it will adjust the value of output to point to appropriate output functions based on the state of the neighbour.

For example, neigh_connect will be used to set the output pointer to neigh->ops->connected_output when the code determines the neighbour to be connected. Alternatively, neigh_suspect will be used to set the output pointer to neigh->ops->output when the code suspects that the neighbour may be down (for example if has been more than /proc/sys/net/ipv4/neigh/default/delay_first_probe_time seconds since a probe was sent).

In other words: neigh->output is set to another pointer, either neigh->ops_connected_output or neigh->ops->output depending on it’s state. Where does neigh->ops come from?

After the neighbour is allocated, arp_constructor (from ./net/ipv4/arp.c) is called to set some of the fields of the struct neighbour . In particular, this function checks the device associated with this neighbour and if the device exposes a header_ops structure that contains a cache function (ethernet devices do) neigh->ops is set to the following structure defined in ./net/ipv4/arp.c: static const struct neigh_ops arp_hh_ops = { .family = AF_INET, .solicit = arp_solicit, .error_report = arp_error_report, .output = neigh_resolve_output, .connected_output = neigh_resolve_output, };

So, regardless of whether or not the neighbour is considered “connected” or “suspect” by the neighbour cache code, the neigh_resolve_output function will be attached to neigh->output and will be called when n->output is called above.

neigh_resolve_output

This function’s purpose is to attempt to resolve a neighbour that is not connected or one which is connected, but has no cached hardware header. Let’s take a look at how this function works: /* Slow and careful. */ int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb) { struct dst_entry *dst = skb_dst(skb); int rc = 0; if (!dst) goto discard; if (!neigh_event_send(neigh, skb)) { int err; struct net_device *dev = neigh->dev; unsigned int seq;

The code starts by doing some basic checks and proceeds to calling neigh_event_send . The neigh_event_send function is short wrapper around __neigh_event_send which will do the heavy lifting to resolve the neighbour. You can read the source for __neigh_event_send in ./net/core/neighbour.c, but the high-level takeaway from the code is that there are three cases users will most interested in:

Neighbours in state NUD_NONE (the default state when allocated) will cause an immediate ARP request to be sent assuming the values set in /proc/sys/net/ipv4/neigh/default/app_solicit and /proc/sys/net/ipv4/neigh/default/mcast_solicit allow probes to be sent (if not, the state is marked as NUD_FAILED ). The neighbour state will be updated and set to NUD_INCOMPLETE . Neighbours in state NUD_STALE will be updated to NUD_DELAYED and a timer will be set to probe them later (later is the time now + /proc/sys/net/ipv4/neigh/default/delay_first_probe_time seconds). Any neighbours in NUD_INCOMPLETE (including things from case 1 above) will be checked to ensure that the number of queued packets for an unresolved neighbour is less than or equal to /proc/sys/net/ipv4/neigh/default/unres_qlen . If there are more, packets are dequeued and dropped until the length is below or equal to the value in proc. A statistics counter in the neighbour cache stats is bumped for all occurrences of this.

If an immediate ARP probe is needed it will be sent. __neigh_event_send will return either 0 indicating that the neighbour is considered “connected” or “delayed” or 1 otherwise. The return value of 0 allows neigh_resolve_output to continue: if (dev->header_ops->cache && !neigh->hh.hh_len) neigh_hh_init(neigh, dst);

If the device’s protocol implementation (ethernet in our case) associated with the neighbour supports caching the hardware header and it is currently not cached, the call to neigh_hh_init will cache it. do { __skb_pull(skb, skb_network_offset(skb)); seq = read_seqbegin(&neigh->ha_lock); err = dev_hard_header(skb, dev, ntohs(skb->protocol), neigh->ha, NULL, skb->len); } while (read_seqretry(&neigh->ha_lock, seq));

Next, a seqlock is used to synchronize access to the neighbour structure’s hardware address which will be read by dev_hard_header when attempting to create the ethernet header for the skb. Once the seqlock has allowed execution to continue, error checking takes place: if (err >= 0) rc = dev_queue_xmit(skb); else goto out_kfree_skb; }

If the ethernet header was written without returning an error, the skb is handed down to dev_queue_xmit to pass through the Linux network device subsystem for transmit. If there was an error, a goto will drop the skb, set the return code and return the error: out: return rc; discard: neigh_dbg(1, "%s: dst=%p neigh=%p

", __func__, dst, neigh); out_kfree_skb: rc = -EINVAL; kfree_skb(skb); goto out; } EXPORT_SYMBOL(neigh_resolve_output);

Before we proceed into the Linux network device subsystem, let’s take a look at some files for monitoring and turning the IP protocol layer.

Create an RPM repository in less than 10 seconds, free. Sign up!

Monitoring: IP protocol layer

/proc/net/snmp

Monitor detailed IP protocol statistics by reading /proc/net/snmp . $ cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 12987882 51 1 10129840 2196520 1 0 0 0 ...

This file contains statistics for several protocol layers. The IP protocol layer appears first. The first line contains space separate names for each of the corresponding values in the next line.

In the IP protocol layer, you will find statistics counters being bumped. Those counters are referenced by a C enum. All of the valid enum values and the field names they correspond to in /proc/net/snmp can be found in include/uapi/linux/snmp.h: enum { IPSTATS_MIB_NUM = 0, /* frequently written fields in fast path, kept in same cache line */ IPSTATS_MIB_INPKTS, /* InReceives */ IPSTATS_MIB_INOCTETS, /* InOctets */ IPSTATS_MIB_INDELIVERS, /* InDelivers */ IPSTATS_MIB_OUTFORWDATAGRAMS, /* OutForwDatagrams */ IPSTATS_MIB_OUTPKTS, /* OutRequests */ IPSTATS_MIB_OUTOCTETS, /* OutOctets */ /* ... */

Some interesting statistics:

OutRequests : Incremented each time an IP packet is attempted to be sent. It appears that this is incremented for every send, successful or not.

: Incremented each time an IP packet is attempted to be sent. It appears that this is incremented for every send, successful or not. OutDiscards : Incremented each time an IP packet is discarded. This can happen if appending data to the skb (for corked sockets) fails, or if the layers below IP return an error.

: Incremented each time an IP packet is discarded. This can happen if appending data to the skb (for corked sockets) fails, or if the layers below IP return an error. OutNoRoute : Incremented in several places, for example in the UDP protocol layer ( udp_sendmsg ) if no route can be generated for a given destination. Also incremented when an application calls “connect” on a UDP socket but no route can be found.

: Incremented in several places, for example in the UDP protocol layer ( ) if no route can be generated for a given destination. Also incremented when an application calls “connect” on a UDP socket but no route can be found. FragOKs : Incremented once per packet that is fragmented. For example, a packet split into 3 fragments will cause this counter to be incremented once.

: Incremented once per packet that is fragmented. For example, a packet split into 3 fragments will cause this counter to be incremented once. FragCreates : Incremented once per fragment that is created. For example, a packet split into 3 fragments will cause this counter to be incremented thrice.

: Incremented once per fragment that is created. For example, a packet split into 3 fragments will cause this counter to be incremented thrice. FragFails : Incremented if fragmentation was attempted, but is not permitted (because the “Don’t Fragment” bit is set). Also incremented if outputting the fragment fails.

Other statistics are documented in the receive side blog post.

/proc/net/netstat

Monitor extended IP protocol statistics by reading /proc/net/netstat . $ cat /proc/net/netstat | grep IpExt IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT0Pktsu InCEPkts IpExt: 0 0 0 0 277959 0 14568040307695 32991309088496 0 0 58649349 0 0 0 0 0

The format is similar to /proc/net/snmp , except the lines are prefixed with IpExt .

Some interesting statistics:

OutMcastPkts : Incremented each time a packet destined for a multicast address is sent.

: Incremented each time a packet destined for a multicast address is sent. OutBcastPkts : Incremented each time a packet destined for a broadcast address is sent.

: Incremented each time a packet destined for a broadcast address is sent. OutOctects : The number of packet bytes output.

: The number of packet bytes output. OutMcastOctets : The number of multicast packet bytes output.

: The number of multicast packet bytes output. OutBcastOctets : The number of broadcast packet bytes output.

Other statistics are documented in the receive side blog post.

Note that each of these is incremented in really specific locations in the IP layer. Code gets moved around from time to time and double counting errors or other accounting bugs can sneak in. If these statistics are important to you, you are strongly encouraged to read the IP protocol layer source code for the metrics that are important to you so you understand when they are (and are not) being incremented.

Linux netdevice subsystem

Before we pick up on the packet transmit path with dev_queue_xmit , let’s take a moment to talk about some important concepts which will appear in the coming sections.

Linux traffic control

Linux supports a feature called traffic control. This feature allows system administrators to control how packets are transmit from a machine. This blog post will not dive into the details of every aspect of Linux traffic control. This document provides a great in-depth examination of the system, its control, and its features. There a few concepts that are worth mentioning to make the code seen next easier to understand.

The traffic control system contains several different sets of queuing systems that provide different features for controlling traffic flow. Individual queuing systems are commonly called qdisc and also known as queuing disciplines. You can think of qdiscs as schedulers; qdiscs decide when and how packets are transmit.

On Linux every interface has a default qdisc associated with it. For network hardware that supports only a single transmit queue, the default qdisc pfifo_fast is used. Network hardware that supports multiple transmit queues uses the default qdisc of mq . You can check your system by running tc qdisc .

It is also important to note that some devices support traffic control in hardware which can allow an administrator to offload traffic control to the network hardware and conserve CPU resources on the system.

Now that those ideas have been introduced, let’s proceed down dev_queue_xmit from ./net/core/dev.c.

dev_queue_xmit and __dev_queue_xmit

dev_queue_xmit is a simple wrapper around __dev_queue_xmit : int dev_queue_xmit(struct sk_buff *skb) { return __dev_queue_xmit(skb, NULL); } EXPORT_SYMBOL(dev_queue_xmit);

Following that, __dev_queue_xmit is where the heavy lifting gets done. Let’s take a look and step through this code piece by piece. Follow along: static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) { struct net_device *dev = skb->dev; struct netdev_queue *txq; struct Qdisc *q; int rc = -ENOMEM; skb_reset_mac_header(skb); /* Disable soft irqs for various locks below. Also * stops preemption for RCU. */ rcu_read_lock_bh(); skb_update_prio(skb);

The code above starts out by:

Declaring variables. Preparing the skb to be processed by calling skb_reset_mac_header . This resets the skb’s internal pointers so that the ethernet header can be accessed. rcu_read_lock_bh is called to prepare for reading RCU protected data structures in the code below. Read more about safely using RCU. skb_update_prio is called to set the skb’s priority, if the network priority cgroup is being used.

Now, we’ll get to the more complicated parts of transmitting data ;) txq = netdev_pick_tx(dev, skb, accel_priv);

Here the code attempts to determine which transmit queue to use. As you’ll see later in this post, some network devices expose multiple transmit queues for transmitting data. Let’s see how this works in detail.

Create a RubyGem repository in less than 10 seconds, free. Sign up!

netdev_pick_tx

The netdev_pick_tx code lives in ./net/core/flow_dissector.c. Let’s take a look: struct netdev_queue *netdev_pick_tx(struct net_device *dev, struct sk_buff *skb, void *accel_priv) { int queue_index = 0; if (dev->real_num_tx_queues != 1) { const struct net_device_ops *ops = dev->netdev_ops; if (ops->ndo_select_queue) queue_index = ops->ndo_select_queue(dev, skb, accel_priv); else queue_index = __netdev_pick_tx(dev, skb); if (!accel_priv) queue_index = dev_cap_txqueue(dev, queue_index); } skb_set_queue_mapping(skb, queue_index); return netdev_get_tx_queue(dev, queue_index); }

As you can see above, if the network device supports only a single TX queue, the more complex code is skipped and that single TX queue is returned. Most devices used on higher end servers will have multiple TX queues. There are two cases for devices with multiple TX queues:

The driver implements ndo_select_queue , which can be used to choose a TX queue more intelligently in a hardware or feature specific way, or The driver does not implement `ndo_select_queue, so the kernel should pick the device itself.

As of the 3.13 kernel, not many drivers implement ndo_select_queue . The bnx2x and ixgbe drivers implement this function, but it is only used for fibre channel over ethernet (FCoE). In light of this, let’s assume that the network device does not implement ndo_select_queue and/or that FCoE is not being used. In that case, the kernel will choose the tx queue with __netdev_pick_tx .

Once __netdev_pick_tx determines what the queue is index, skb_set_queue_mapping will cache that value (it will be used later in the traffic control code) and netdev_get_tx_queue will look up and return a pointer to that queue. Let’s take a look at how __netdev_pick_tx works before going back up to __dev_queue_xmit .

__netdev_pick_tx

Let’s take a look at how the kernel chooses the TX queue to use for transmitting data. From ./net/core/flow_dissector.c: u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb) { struct sock *sk = skb->sk; int queue_index = sk_tx_queue_get(sk); if (queue_index < 0 || skb->ooo_okay || queue_index >= dev->real_num_tx_queues) { int new_index = get_xps_queue(dev, skb); if (new_index < 0) new_index = skb_tx_hash(dev, skb); if (queue_index != new_index && sk && rcu_access_pointer(sk->sk_dst_cache)) sk_tx_queue_set(sk, new_index); queue_index = new_index; } return queue_index; }

The code begins first by checking if the transmit queue has already been cached on the socket by calling sk_tx_queue_get , If it hasn’t been cached, -1 is returned.

The next if-statement checks if any of the following are true:

The queue_index is < 0. This will happen if the queue hasn’t been set yet.

If the ooo_okay flag is set. If this flag is set, this means that out of order packets are allowed now. The protocol layers must set this flag appropriately. The TCP protocol layer sets this flag when all outstanding packets for a flow have been acknowledged. When this happens, the kernel can choose a different TX queue for this packet. The UDP protocol layer does not set this flag – so UDP packets will never have ooo_okay set to a non-zero value.

flag is set. If this flag is set, this means that out of order packets are allowed now. The protocol layers must set this flag appropriately. The TCP protocol layer sets this flag when all outstanding packets for a flow have been acknowledged. When this happens, the kernel can choose a different TX queue for this packet. The UDP protocol layer does not set this flag – so UDP packets will never have set to a non-zero value. If the queue index is larger than the number of queues. This can happen if the user has recently changed the queue count on the device via ethtool . More on this later.

In any of those cases, the code descends into the slow path to get the transmit queue. This begins with get_xps_queue which attempts to use a user-configured map linking transmit queues to CPUs. This is called “Transmit Packet Steering.” We’ll look more closely at what Transmit Packet Steering (XPS) is and how it works shortly.

If get_xps_queue returns -1 because this kernel does not support XPS, or XPS was not configured by the system administrator, or the mapping configured refers to an invalid queue the code will continue on to call skb_tx_hash .

Once the queue is selected by either XPS or by the kernel automatically with skb_tx_hash , the queue is cached on the socket object with sk_tx_queue_set and returned. Let’s see how XPS and skb_tx_hash work before continuing through dev_queue_xmit .

Transmit Packet Steering (XPS)

Transmit Packet Steering (XPS) is a feature that allows the system administrator to determine which CPUs can process transmit operations for each available transmit queue supported by the device. The aim of this feature is mainly to avoid lock contention when processing transmit requests. Other benefits like reducing cache evictions and avoiding remote memory access on NUMA machines are also expected when using XPS.

You can read more about how XPS works by checking the kernel documentation for XPS. We’ll examine how to tune XPS for your system below, but for now, all you need to know is that to configure XPS the system administrator can define a bitmap mapping transmit queues to CPUs.

The function call in the code above to get_xps_queue will consult this user-specified map in order to determine which transmit queue should be used. If get_xps_queue returns -1 , skb_tx_hash will be used instead.

skb_tx_hash

If XPS is not included in the kernel, or is not configured, or suggests a queue that is not available (because perhaps the user adjusted the queue count) skb_tx_hash takes over to determine which queue the data should be sent on. Understanding precisely how skb_tx_hash works is important depending on your transmit workload. Note that this code has been adjusted over time, so if you are using a different kernel version than this document, you should consult your kernel source directly.

Let’s take a look at how it works, from ./include/linux/netdevice.h: /* * Returns a Tx hash for the given packet when dev->real_num_tx_queues is used * as a distribution range limit for the returned value. */ static inline u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) { return __skb_tx_hash(dev, skb, dev->real_num_tx_queues); }

The code simply calls down to __skb_tx_hash , from ./net/core/flow_dissector.c. There’s some interesting code in this function, so let’s take a look: /* * Returns a Tx hash based on the given packet descriptor a Tx queues' number * to be used as a distribution range. */ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb, unsigned int num_tx_queues) { u32 hash; u16 qoffset = 0; u16 qcount = num_tx_queues; if (skb_rx_queue_recorded(skb)) { hash = skb_get_rx_queue(skb); while (unlikely(hash >= num_tx_queues)) hash -= num_tx_queues; return hash; }

The first if stanza in this function is an interesting short circuit. The function name skb_rx_queue_recorded is a bit misleading. An skb has a queue_mapping field that is used both for rx and tx. At any rate, this if statement can be true if your system is receiving packets and forwarding them elsewhere. If that isn’t the case, the code continues. if (dev->num_tc) { u8 tc = netdev_get_prio_tc_map(dev, skb->priority); qoffset = dev->tc_to_txq[tc].offset; qcount = dev->tc_to_txq[tc].count; }

To understand this piece of code, it is important to mention that a program can set the priority of data sent on a socket. This can be done by using setsockopt with the SOL_SOCKET and SO_PRIORITY level and optname, respectively. See the socket(7) man page for more information about SO_PRIORITY .

Note that if you have used the setsockopt option IP_TOS to set the TOS flags on the IP packets sent on a particular socket (or on a per-packet basis if passed as an ancillary message to sendmsg ) in your application, the kernel will translate the TOS options set by you to a priority which end up in skb->priority .

As was mentioned earlier, some network devices support hardware based traffic control systems. If num_tc is non-zero, that means this device supports hardware based traffic control.

If that number is non-zero it means that this device supports hardware based traffic control. The priority map which maps packet priority to hardware based traffic control will be consulted. The appropriate traffic class for the data’s priority will be selected based on this map.

Next, the range of appropriate transmit queues for the traffic class will be generated. They will be used to determine the transmit queue.

If num_tc was zero (because the network device does not support hardware based traffic control), the qcount and qoffset variables are set to the number of transmit queues and 0 , respectively.

Using qcount and qoffset , the index of the transmit queue will be calculated: if (skb->sk && skb->sk->sk_hash) hash = skb->sk->sk_hash; else hash = (__force u16) skb->protocol; hash = __flow_hash_1word(hash); return (u16) (((u64) hash * qcount) >> 32) + qoffset; } EXPORT_SYMBOL(__skb_tx_hash);

Finally, the appropriate queue index is returned back up to __netdev_pick_tx .

Create an APT repository in less than 10 seconds, free. Sign up!

Resuming __dev_queue_xmit

At this point the appropriate transmit queue has been selected. __dev_queue_xmit can continue: q = rcu_dereference_bh(txq->qdisc); #ifdef CONFIG_NET_CLS_ACT skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS); #endif trace_net_dev_queue(skb); if (q->enqueue) { rc = __dev_xmit_skb(skb, q, dev, txq); goto out; }

It starts by obtaining a reference to the queuing discipline associated with this queue. Recall that earlier we saw that the default for single transmit queue devices is the pfifo_fast qdisc, whereas for multiqueue devices it is the mq qdisc.

Next, the code assigns a traffic classification “verdict” to the outgoing data, if the packet classification API has been enabled in your kernel. Next, the queue discipline is checked to see if there is a way to queue data. Some queuing disciplines like the noqueue qdisc do not have a queue. If there is a queue, the code calls down to __dev_xmit_skb to continue processing the data for transmit. Afterward, execution jumps to the end of this function. We’ll take a look at __dev_xmit_skb shortly. For now, let’s see what happens if there is no queue, starting with a very helpful comment: /* The device has no queue. Common case for software devices: loopback, all the sorts of tunnels... Really, it is unlikely that netif_tx_lock protection is necessary here. (f.e. loopback and IP tunnels are clean ignoring statistics counters.) However, it is possible, that they rely on protection made by us here. Check this and shot the lock. It is not prone from deadlocks. Either shot noqueue qdisc, it is even simpler 8) */ if (dev->flags & IFF_UP) { int cpu = smp_processor_id(); /* ok because BHs are off */

As the comment illustrates, the only devices that could have a qdisc with no queues are the loopback device and tunnel devices. If the device is currently up, then the current CPU is saved. It used for the next check which is a bit tricky, let’s take a look: if (txq->xmit_lock_owner != cpu) { if (__this_cpu_read(xmit_recursion) > RECURSION_LIMIT) goto recursion_alert;

There’s two cases: the transmit lock on this device queue is owned by this CPU or not. If so, a counter variable xmit_recursion , which is allocated per-CPU, is checked here to determine if the count is over the RECURSION_LIMIT . It is possible that one program could attempt to send data and get preempted right around this place in the code. Another program could be selected by the scheduler to run. If that second program attempts to send data as well and lands here. So, the xmit_recursion counter is used to prevent more than RECURSION_LIMIT programs from racing here to transmit data. Let’s keep going: HARD_TX_LOCK(dev, txq, cpu); if (!netif_xmit_stopped(txq)) { __this_cpu_inc(xmit_recursion); rc = dev_hard_start_xmit(skb, dev, txq); __this_cpu_dec(xmit_recursion); if (dev_xmit_complete(rc)) { HARD_TX_UNLOCK(dev, txq); goto out; } } HARD_TX_UNLOCK(dev, txq); net_crit_ratelimited("Virtual device %s asks to queue packet!

", dev->name); } else { /* Recursion is detected! It is possible, * unfortunately */ recursion_alert: net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!

", dev->name); } }

The remainder of the code starts by trying to take the transmit lock. The device’s transmit queue to be used is checked to see if transmit is stopped. If not, the xmit_recursion variable is incremented and the data is passed down closer to the device to be transmit. We’ll see dev_hard_start_xmit in more detail later. Once this completes, the locks are released and a warning is printed.

Alternatively, if the current CPU is transmit lock owner, or if the RECURSION_LIMIT is hit, no transmit is done, but a warning is printed. The remaining code in the function sets the error code and returns.

Since we are interested in real ethernet devices, let’s continue down the code path that would have been taken for those earlier via __dev_xmit_skb .

__dev_xmit_skb

And now we descend into __dev_xmit_skb from ./net/core/dev.c armed with the queuing discipline, network device, and transmit queue reference: static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq) { spinlock_t *root_lock = qdisc_lock(q); bool contended; int rc; qdisc_pkt_len_init(skb); qdisc_calculate_pkt_len(skb, q); /* * Heuristic to force contended enqueues to serialize on a * separate lock before trying to get qdisc main lock. * This permits __QDISC_STATE_RUNNING owner to get the lock more often * and dequeue packets faster. */ contended = qdisc_is_running(q); if (unlikely(contended)) spin_lock(&q->busylock);

This code begins by using qdisc_pkt_len_init and qdisc_calculate_pkt_len to compute an accurate length for the data that will be used by the qdisc later. This is necessary for skbs that will pass through hardware based send offloading (such as UDP Fragmentation Offloading, as we saw earlier) as the additional headers that will be added when fragmentation occurs need to be taken into account.

Next, a lock is used to help reduce contention on the qdisc’s main lock (a second lock we’ll see later). If qdisc is currently running, then other programs attempting to transmit will contend on the qdisc’s busylock . This allows the running qdisc to process packets and contend with a smaller number of programs for the second, main lock. This trick increases throughput as the number of contenders is reduced. You can read the original commit message describing this here. Next the main lock is taken: spin_lock(root_lock);

Now, we approach an if statement that handles 3 possible cases:

The qdisc is deactivated. The qdisc allows packets to bypass the queuing system, there are no other packets to send, and the qdisc is not currently running. A qdisc allows packet bypass for “work-conserving” qdisc - in other words, a qdisc that does not delay packet transmit for traffic shaping purposes. All other cases.

Let’s take a look at what happens in each of these cases, in order starting with a deactivated qdisc: if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { kfree_skb(skb); rc = NET_XMIT_DROP;

This is straightforward. If the qdisc is deactivated, free the data and set the return code to NET_XMIT_DROP . Next, a qdisc allowing packet bypass, with no other outstanding packets, that is not currently running: } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && qdisc_run_begin(q)) { /* * This is a work-conserving queue; there are no old skbs * waiting to be sent out; and the qdisc is not running - * xmit the skb directly. */ if (!(dev->priv_flags & IFF_XMIT_DST_RELEASE)) skb_dst_force(skb); qdisc_bstats_update(q, skb); if (sch_direct_xmit(skb, q, dev, txq, root_lock)) { if (unlikely(contended)) { spin_unlock(&q->busylock); contended = false; } __qdisc_run(q); } else qdisc_run_end(q); rc = NET_XMIT_SUCCESS;

This if statement is a bit tricky. The entire statement evaluates as true if all of the following are true:

q->flags & TCQ_F_CAN_BYPASS : The qdisc allows packets to bypass the queuing system. This will be true for “work-conserving” qdiscs; i.e. qdiscs that do not delay packet transmit for traffic shaping purposes are considered “work-conserving” and allow packet bypass. The pfifo_fast qdisc allows packets to bypass the queuing system. !qdisc_qlen(q) : The qdisc’s queue has no data in it that is waiting to be transmit. qdisc_run_begin(p) : This function call will either set the qdisc’s state as “running” and return true or return false if the qdisc was already running.

If all of the above evaluate to true, then:

The IFF_XMIT_DST_RELEASE flag is checked. If enabled, this flag indicates that the kernel is allowed to free the skb’s destination cache structure. The code in this function checks if the flag is disabled and forces a reference count on that structure.

flag is checked. If enabled, this flag indicates that the kernel is allowed to free the skb’s destination cache structure. The code in this function checks if the flag is disabled and forces a reference count on that structure. qdisc_bstats_update is used to increment the number of bytes and packet sent by the qdisc.

is used to increment the number of bytes and packet sent by the qdisc. sch_direct_xmit is used to attempt to transmit the packet. We’ll dive more into sch_direct_xmit shortly as it is used in the slower code path, too.

The return value of sch_direct_xmit is checked for two cases:

The queue is not empty ( >0 returned). In this case, lock preventing contention from other programs is released and __qdisc_run is called to restart the qdisc processing. The queue was empty ( 0 is returned). In this case qdisc_run_end is used to turn off qdisc processing.

In either case, the return value NET_XMIT_SUCCESS is set as the return code. That wasn’t too bad. Let’s check the last case, which is catch all: } else { skb_dst_force(skb); rc = q->enqueue(skb, q) & NET_XMIT_MASK; if (qdisc_run_begin(q)) { if (unlikely(contended)) { spin_unlock(&q->busylock); contended = false; } __qdisc_run(q); } }

In all other cases:

Call skb_dst_force to force a reference count bump on the skb’s destination cache reference. Queue the data to the qdisc by calling the enqueue function of the queue disc. Store the return code. Call qdisc_run_begin(p) to mark the qdisc as running. If it was not already running, the busylock is released and __qdisc_run(p) is called to start qdisc processing.

The function then finishes up by releasing some locks and returning the return code: spin_unlock(root_lock); if (unlikely(contended)) spin_unlock(&q->busylock); return rc;

Easy to use Maven repositories, free. Sign up!

Tuning: Transmit Packet Steering (XPS)

For XPS to work, it must be enabled in the kernel configuration (it is on Ubuntu for kernel 3.13.0), and a bitmask describing which CPUs should process packets for a given interface and TX queue.

These bitmasks are similar to the RPS bitmasks and you can find some documentation about these bitmasks in the kernel documentation.

In short, the bitmasks to modify are found in:

/sys/class/net/DEVICE_NAME/queues/QUEUE/xps_cpus

So, for eth0 and transmit queue 0, you would modify the file: /sys/class/net/eth0/queues/tx-0/xps_cpus with a hexadecimal number indicating which CPUs should process transmit completions from eth0 ’s transmit queue 0. As the documentation points out, XPS may be unnecessary in certain configurations.

Queuing disciplines!

To follow the path of network data, we’ll need to move into the qdisc code a bit. This post does not intend to cover the specific details of each of the different transmit queue options. If you are interested in that, check this excellent guide.

For the purpose of this blog post, we’ll continue the code path by examining how the generic packet scheduler code works. In particular, we’ll explore how qdisc_run_begin , qdisc_run_end , __qdisc_run , and sch_direct_xmit work to move network data closer to the driver for transmit.

Let’s start by examining how qdisc_run_begin works and proceed from there.

qdisc_run_begin and qdisc_run_end

The qdisc_run_begin function can be found in ./include/net/sch_generic.h: static inline bool qdisc_run_begin(struct Qdisc *qdisc) { if (qdisc_is_running(qdisc)) return false; qdisc->__state |= __QDISC___STATE_RUNNING; return true; }

This function is simple: the qdisc __state flag is checked. If it’s already running, false is returned. Otherwise, __state is updated to enable the __QDISC___STATE_RUNNING bit.

Similarly, qdisc_run_end is anti-climactic: static inline void qdisc_run_end(struct Qdisc *qdisc) { qdisc->__state &= ~__QDISC___STATE_RUNNING; }

It simply disables the __QDISC___STATE_RUNNING bit from the qdisc’s __state field. It is important to note that both of these functions simply flip bits; neither actually start or stop processing themselves. The function __qdisc_run , on the other hand, will actually start processing.

__qdisc_run

The code for __qdisc_run is deceptively brief: void __qdisc_run(struct Qdisc *q) { int quota = weight_p; while (qdisc_restart(q)) { /* * Ordered by possible occurrence: Postpone processing if * 1. we've exceeded packet quota * 2. another process needs the CPU; */ if (--quota <= 0 || need_resched()) { __netif_schedule(q); break; } } qdisc_run_end(q); }

This function begins by obtaining the weight_p value. This is set typically via a sysctl and is also used in the receive path. We’ll see later how to adjust this value. This loop does two things:

It calls qdisc_restart in a busy loop until it returns false (or the break below is triggered). Determines if either the quota drops below zero or need_resched() returns true. If either is true , __netif_schedule is called and the loop is broken out of.

Remember: up to now the kernel is still executing on behalf of the original call to sendmsg by the user program; the user program is currently accumulating system time. If the user program has exhausted its time quota in the kernel, need_resched will return true. If there’s still available quota and the user program hasn’t used is time slice up yet, qdisc_restart will be called over again.

Let’s see how qdisc_restart(q) works and then we’ll dive into __netif_schedule(q) .

qdisc_restart

Let’s jump into the code for qdisc_restart : /* * NOTE: Called under qdisc_lock(q) with locally disabled BH. * * __QDISC_STATE_RUNNING guarantees only one CPU can process * this qdisc at a time. qdisc_lock(q) serializes queue accesses for * this queue. * * netif_tx_lock serializes accesses to device driver. * * qdisc_lock(q) and netif_tx_lock are mutually exclusive, * if one is grabbed, another must be free. * * Note, that this procedure can be called by a watchdog timer * * Returns to the caller: * 0 - queue is empty or throttled. * >0 - queue is not empty. * */ static inline int qdisc_restart(struct Qdisc *q) { struct netdev_queue *txq; struct net_device *dev; spinlock_t *root_lock; struct sk_buff *skb; /* Dequeue packet */ skb = dequeue_skb(q); if (unlikely(!skb)) return 0; WARN_ON_ONCE(skb_dst_is_noref(skb)); root_lock = qdisc_lock(q); dev = qdisc_dev(q); txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb)); return sch_direct_xmit(skb, q, dev, txq, root_lock); }

The qdisc_restart function begins with a useful comment describing some of the locking constraints for calling this function. The first operation this function performs is to attempt to dequeue an skb from the qdisc.

The function dequeue_skb will attempt to obtain the next packet to transmit. If the queue is empty qdisc_restart will return false (causing the loop in __qdisc_run above to bail).

Assuming there is data to transmit, the code continues by obtaining a reference to the qdisc queue lock, the qdisc’s associated device, and the transmit queue.

All of these are passed through to sch_direct_xmit . Let’s take a look at dequeue_skb and then we’ll come back sch_direct_xmit .

Create a RubyGem repository in less than 10 seconds, free. Sign up!

dequeue_skb

Let’s take a look at dequeue_skb from ./net/sched/sch_generic.c. This function handles two major cases:

Dequeuing data that was requeued because it could not be sent before, or Dequeuing new data from the qdisc to be processed.

Let’s take a look at the first case: static inline struct sk_buff *dequeue_skb(struct Qdisc *q) { struct sk_buff *skb = q->gso_skb; const struct netdev_queue *txq = q->dev_queue; if (unlikely(skb)) { /* check the reason of requeuing without tx lock first */ txq = netdev_get_tx_queue(txq->dev, skb_get_queue_mapping(skb)); if (!netif_xmit_frozen_or_stopped(txq)) { q->gso_skb = NULL; q->q.qlen--; } else skb = NULL;

Note that the code begins by taking a reference to gso_skb field of the qdisc. This field holds a reference to data that was requeued. If no data was requeued, this field will be NULL . If that field is not NULL , the code continues by getting the transmit queue for the data and checking if the queue is stopped. If the queue is not stopped, the gso_skb field is cleared and the queue length counter is decreased. If the queue is stopped, the data remains attached to gso_skb , but NULL will be returned from this function.

Let’s check the next case, where there is no data that was requeued: } else { if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) skb = q->dequeue(q); } return skb; }

In the case where no data was requeued, another tricky compound if statement is evaluated. If:

The qdisc does not have a single transmit queue, or The transmit queue is not stopped

Then, the qdisc’s dequeue function will be called to obtain new data. The internal implementation of dequeue will vary depending on the qdisc’s implementation and features.

The function finishes by returning the data that is up for processing.

sch_direct_xmit

Now we come to sch_direct_xmit (in ./net/sched/sch_generic.c) which is an important participant in moving data down toward the network device. Let’s walk through it, piece by piece: /* * Transmit one skb, and handle the return status as required. Holding the * __QDISC_STATE_RUNNING bit guarantees that only one CPU can execute this * function. * * Returns to the caller: * 0 - queue is empty or throttled. * >0 - queue is not empty. */ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq, spinlock_t *root_lock) { int ret = NETDEV_TX_BUSY; /* And release qdisc */ spin_unlock(root_lock); HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_xmit_frozen_or_stopped(txq)) ret = dev_hard_start_xmit(skb, dev, txq); HARD_TX_UNLOCK(dev, txq);

The code begins by unlocking the qdisc lock and then locking the transmit lock. Note that HARD_TX_LOCK is a macro: #define HARD_TX_LOCK(dev, txq, cpu) { \ if ((dev->features & NETIF_F_LLTX) == 0) { \ __netif_tx_lock(txq, cpu); \ } \ }

This macro is checking if the device has the NETIF_F_LLTX flag set in its feature flags. This flag is deprecated and should not be used by new device drivers. Most drivers in this kernel version do not use this flag, so this check will evaluate to to true and the lock for the transmit queue for this data will be obtained.

Next, the transmit queue is checked to ensure that it is not stopped and then dev_hard_start_xmit is called. As