We recently gave a presentation on Programming socket lookup with BPF at the Linux Plumbers Conference 2019 in Lisbon, Portugal. This blog post is a recap of the problem statement and proposed solution we presented.

CC0 Public Domain, PxHere

Our edge servers are crowded. We run more than a dozen public facing services, leaving aside the all internal ones that do the work behind the scenes.

Quick Quiz #1: How many can you name? We blogged about them! Jump to answer.

These services are exposed on more than a million Anycast public IPv4 addresses partitioned into 100+ network prefixes.

To keep things uniform every Cloudflare edge server runs all services and responds to every Anycast address. This allows us to make efficient use of the hardware by load-balancing traffic between all machines. We have shared the details of Cloudflare edge architecture on the blog before.

Granted not all services work on all the addresses but rather on a subset of them, covering one or several network prefixes.

So how do you set up your network services to listen on hundreds of IP addresses without driving the network stack over the edge?

Cloudflare engineers have had to ask themselves this question more than once over the years, and the answer has changed as our edge evolved. This evolution forced us to look for creative ways to work with the Berkeley sockets API, a POSIX standard for assigning a network address and a port number to your application. It has been quite a journey, and we are not done yet.

When life is simple - one address, one socket

The simplest kind of association between an (IP address, port number) and a service that we can imagine is one-to-one. A server responds to client requests on a single address, on a well known port. To set it up the application has to open one socket for each transport protocol (be it TCP or UDP) it wants to support. A network server like our authoritative DNS would open up two sockets (one for UDP, one for TCP):

(192.0.2.1, 53/tcp) ⇨ ("auth-dns", pid=1001, fd=3) (192.0.2.1, 53/udp) ⇨ ("auth-dns", pid=1001, fd=4)

To take it to Cloudflare scale, the service is likely to have to receive on at least a /20 network prefix, which is a range of IPs with 4096 addresses in it.

This translates to opening 4096 sockets for each transport protocol. Something that is not likely to go unnoticed when looking at ss tool output.

$ sudo ss -ulpn 'sport = 53' State Recv-Q Send-Q Local Address:Port Peer Address:Port … UNCONN 0 0 192.0.2.40:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11076)) UNCONN 0 0 192.0.2.39:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11075)) UNCONN 0 0 192.0.2.38:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11074)) UNCONN 0 0 192.0.2.37:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11073)) UNCONN 0 0 192.0.2.36:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11072)) UNCONN 0 0 192.0.2.31:53 0.0.0.0:* users:(("auth-dns",pid=77556,fd=11071)) …

CC BY 2.0, Luca Nebuloni, Flickr

The approach, while naive, has an advantage: when an IP from the range gets attacked with a UDP flood, the receive queues of sockets bound to the remaining IP addresses are not affected.

Life can be easier - all addresses, one socket

It seems rather silly to create so many sockets for one service to receive traffic on a range of addresses. Not only that, the more listening sockets there are, the longer the chains in the socket lookup hash table. We have learned the hard way that going in this direction can hurt packet processing latency.

The sockets API comes with a big hammer that can make our life easier - the INADDR_ANY aka 0.0.0.0 wildcard address. With INADDR_ANY we can make a single socket receive on all addresses assigned to our host, specifying just the port.

s = socket(AF_INET, SOCK_STREAM, 0) s.bind(('0.0.0.0', 12345)) s.listen(16)

Quick Quiz #2: Is there another way to bind a socket to all local addresses? Jump to answer.

In other words, compared to the naive “one address, one socket” approach, INADDR_ANY allows us to have a single catch-all listening socket for the whole IP range on which we accept incoming connections.

In Linux this is possible thanks to a two-phase listening socket lookup, where it falls back to search for an INADDR_ANY socket if a more specific match has not been found.

Another upside of binding to 0.0.0.0 is that our application doesn’t need to be aware of what addresses we have assigned to our host. We are also free to assign or remove the addresses after binding the listening socket. No need to reconfigure the service when its listening IP range changes.

On the other hand if our service should be listening on just A.B.C.0/20 prefix, binding to all local addresses is more than we need. We might unintentionally expose an otherwise internal-only service to external traffic without a proper firewall or a socket filter in place.

Then there is the security angle. Since we now only have one socket, attacks attempting to flood any of the IPs assigned to our host on our service’s port, will hit the catch-all socket and its receive queue. While in such circumstances the Linux TCP stack has your back, UDP needs special care or legitimate traffic might drown in the flood of dropped packets.

Possibly the biggest downside, though, is that a service listening on the wildcard INADDR_ANY address claims the port number exclusively for itself. Binding over the wildcard-listening socket with a specific IP and port fails miserably due to the address already being taken ( EADDRINUSE ).

bind(3, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)

Unless your service is UDP-only, setting the SO_REUSEADDR socket option, will not help you overcome this restriction. The only way out is to turn to SO_REUSEPORT , normally used to construct a load-balancing socket group. And that is only if you are lucky enough to run the port-conflicting services as the same user (UID). That is a story for another post.

Quick Quiz #3: Does setting the SO_REUSEADDR socket option have any effect at all when there is bind conflict? Jump to answer.

Life gets real - one port, two services

As it happens, at the Cloudflare edge we do host services that share the same port number but otherwise respond to requests on non-overlapping IP ranges. A prominent example of such port-sharing is our 1.1.1.1 recursive DNS resolver running side-by-side with the authoritative DNS service that we offer to all customers.

Sadly the sockets API doesn’t allow us to express a setup in which two services share a port and accept requests on disjoint IP ranges.

However, as Linux development history shows, any networking API limitation can be overcome by introducing a new socket option, with sixty-something options available (and counting!).

Enter SO_BINDTOPREFIX .

Back in 2016 we proposed an extension to the Linux network stack. It allowed services to constrain a wildcard-bound socket to an IP range belonging to a network prefix.

# Service 1, 127.0.0.0/20, 1234/tcp net1, plen1 = '127.0.0.0', 20 bindprefix1 = struct.pack('BBBBBxxx', *inet_aton(net1), plen1) s1 = socket(AF_INET, SOCK_STREAM, 0) s1.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix1) s1.bind(('0.0.0.0', 1234)) s1.listen(1) # Service 2, 127.0.16.0/20, 1234/tcp net2, plen2 = '127.0.16.0', 20 bindprefix2 = struct.pack('BBBBBxxx', *inet_aton(net2), plen2) s2 = socket(AF_INET, SOCK_STREAM, 0) s2.setsockopt(SOL_IP, IP_BINDTOPREFIX, bindprefix2) s2.bind(('0.0.0.0', 1234)) s2.listen(1)

This mechanism has served us well since then. Unfortunately, it didn’t get accepted upstream due to being too specific to our use-case. Having no better alternative we ended up maintaining patches in our kernel to this day.

Life gets complicated - all ports, one service

Just when we thought we had things figured out, we were faced with a new challenge. How to build a service that accepts connections on any of the 65,535 ports? The ultimate reverse proxy, if you will, code named Spectrum.

The bind syscall offers very little flexibility when it comes to mapping a socket to a port number. You can either specify the number you want or let the network stack pick an unused one for you. There is no counterpart of INADDR_ANY , a wildcard value to select all ports ( INPORT_ANY ?).

To achieve what we wanted, we had to turn to TPROXY, a Netfilter / iptables extension designed for intercepting remote-destined traffic on the forward path. However, we use it to steer local-destined packets, that is ones targeted to our host, to a catch-all-ports socket.

iptables -t mangle -I PREROUTING \ -d 192.0.2.0/24 -p tcp \ -j TPROXY --on-ip=127.0.0.1 --on-port=1234

TPROXY-based setup comes at a price. For starters, your service needs elevated privileges to create a special catch-all socket (see the IP_TRANSPARENT socket option). Then you also have to understand and consider the subtle interactions between TPROXY and the receive path for your traffic profile, for example:

does connection tracking register the flows redirected with TPROXY?

is listening socket contention during a SYN flood when using TPROXY a concern?

do other parts of the network stack, like XDP programs, need to know about TPROXY redirecting packets?

These are some of the questions we needed to answer, and after running it in production for a while now, we have a good idea of what the consequences of using TPROXY are.

That said, it would not come as a shock, if tomorrow we’d discovered something new about TPROXY. Due to its complexity we’ve always considered using it to steer local-destined traffic a hack, a use-case outside of its intended application. No matter how well understood, a hack remains a hack.

Can BPF make life easier?

Despite its complex nature TPROXY shows us something important. No matter what IP or port the listening socket is bound to, with a bit of support from the network stack, we can steer any connection to it. As long the application is ready to handle this situation, things work.

Quick Quiz #4: Are there really no problems with accepting any connection on any socket? Jump to answer.

This is a really powerful concept. With a bunch of TPROXY rules, we can configure any mapping between (address, port) tuples and listening sockets.

💡 Idea #1: A local-destined connection can be accepted by any listening socket.

We didn’t tell you the whole story before. When we published SO_BINDTOPREFIX patches, they did not just get rejected. As sometimes happens by posting the wrong answer, we got the right answer to our problem

❝BPF is absolutely the way to go here, as it allows for whatever user specified tweaks, like a list of destination subnetwork, or/and a list of source network, or the date/time of the day, or port knocking without netfilter, or … you name it.❞

💡 Idea #2: How we pick a listening socket can be tweaked with BPF.

Combine the two ideas together and we arrive at an exciting concept. Let’s run BPF code to match an incoming packet with a listening socket, ignoring the address the socket is bound to. 🤯

Here’s an example to illustrate it.

All packets coming on 192.0.2.0/24 prefix, port 53 are steered to socket sk:2 , while traffic targeted at 203.0.113.1 , on any port number lands in socket sk:4 .

Welcome BPF inet_lookup

To make this concept a reality we are proposing a new mechanism to program the socket lookup with BPF. What is socket lookup? It’s a stage on the receive path where the transport layer searches for a socket to dispatch the packet to. The last possible moment to steer packets before they land in the selected socket receive queue. In there we attach a new type of BPF program called inet_lookup .

If you recall, socket lookup in the Linux TCP stack is a two phase process. First the kernel will try to find an established (connected) socket matching the packet 4-tuple. If there isn’t one, it will continue by looking for a listening socket using just the packet 2-tuple as key.

Our proposed extension allows users to program the second phase, the listening socket lookup. If present, a BPF program is allowed to choose a listening socket and terminate the lookup. Our program is also free to ignore the packet, in which case the kernel will continue to look for a listening socket as usual.

How does this new type of BPF program operate? On input, as context, it gets handed a subset of information extracted from packet headers, including the packet 4-tuple. Based on the input the program accesses a BPF map containing references to listening sockets, and selects one to yield as the socket lookup result.

If we take a look at the corresponding BPF code, the program structure resembles a firewall rule. We have some match statements followed by an action.

You may notice that we don’t access the BPF map with sockets directly. Instead we follow an established pattern in BPF called “map based redirection”, where a dedicated BPF helper accesses the map and carries out any steps necessary to redirect the packet.

We’ve skipped over one thing. Where does the BPF map of sockets come from? We create it ourselves and populate it with sockets. This is most easily done if your service uses systemd socket activation. systemd will let you associate more than one service unit with a socket unit, and both of the services will receive a file descriptor for the same socket. From there it’s just a matter of inserting the socket into the BPF map.

Demo time!

This is not just a concept. We have already published a first working set of patches for the kernel together with ancillary user-space tooling to configure the socket lookup to your needs.

If you would like to see it in action, you are in luck. We’ve put together a demo that shows just how easily you can bind a network service to (i) a single port, (ii) all ports, or (iii) a network prefix. On-the-fly, without having to restart the service! There is a port scan running to prove it.

You can also bind to all-addresses-all-ports ( 0.0.0.0/0 ) because why not? Take that INADDR_ANY . All thanks to BPF superpowers.

Summary

We have gone over how the way we bind services to network addresses on the Cloudflare edge has evolved over time. Each approach has its pros and cons, summarized below. We are currently working on a new BPF-based mechanism for binding services to addresses, which is intended to address the shortcomings of existing solutions.

bind to one address and port

👍 flood traffic on one address hits one socket, doesn’t affect the rest

👎 as many sockets as listening addresses, doesn’t scale

bind to all addresses with INADDR_ANY

👍 just one socket for all addresses, the kernel thanks you

👍 application doesn’t need to know about listening addresses

👎 flood scenario requires custom protection, at least for UDP

👎 port sharing is tricky or impossible

bind to a network prefix with SO_BINDTOPREFIX

👍 two services can share a port if their IP ranges are non-overlapping

👎 custom kernel API extension that never went upstream

bind to all port with TPROXY

👍 enables redirecting all ports to a listening socket and more

👎 meant for intercepting forwarded traffic early on the ingress path

👎 has subtle interactions with the network stack

👎 requires privileges from the application

bind to anything you want with BPF inet_lookup

👍 allows for the same flexibility as with TPROXY or SO_BINDTOPREFIX

👍 services don’t need extra capabilities, meant for local traffic only

👎 needs cooperation from services or PID 1 to build a socket map

Getting to this point has been a team effort. A special thank you to Lorenz Bauer and Marek Majkowski who have contributed in an essential way to the BPF inet_lookup implementation. The SO_BINDTOPREFIX patches were authored by Gilberto Bertin.

Fancy joining the team? Apply here!

Quiz Answers

Quiz 1

Q: How many Cloudflare services can you name?

Go back

Quiz 2

Q: Is there another way to bind a socket to all local addresses?

Yes, there is - by not bind() ’ing it at all. Calling listen() on an unbound socket is equivalent to binding it to INADDR_ANY and letting the kernel pick a free port.

$ strace -e socket,bind,listen nc -l socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 listen(3, 1) = 0 ^Z [1]+ Stopped strace -e socket,bind,listen nc -l $ ss -4tlnp State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 1 *:42669

Go back

Quiz 3

Q: Does setting the SO_REUSEADDR socket option have any effect at all when there is bind conflict?

Yes. If two processes are racing to bind and listen on the same TCP port, on an overlapping IP, setting SO_REUSEADDR changes which syscall will report an error ( EADDRINUSE ). Without SO_REUSEADDR it will always be the second bind. With SO_REUSEADDR set there is a window of opportunity for a second bind to succeed but the subsequent listen to fail.

Go back

Quiz 4

Q: Are there really no problems with accepting any connection on any socket?

If the connection is destined for an address assigned to our host, i.e. a local address, there are no problems. However, for remote-destined connections, sending return traffic from a non-local address (i.e., one not present on any interface) will not get past the Linux network stack. The IP_TRANSPARENT socket option bypasses this protection mechanism known as source address check to lift this restriction.

Go back