Back in April, we noticed that several of our applications, but not all, were quite frequently timing out querying either internal or external services, regardless of the ports or protocols. Reproducing the issue was as simple as using cURL in any of our containers, to any destination, where the majority of the queries would stall for durations close to multiples of five seconds. Five seconds, you say? That is generally the red flag for DNS issues. Let’s figure out…

An initial look at the problem

With our Kubernetes stacks using recent and fairly trusted components (AWS, Kubernetes 1.10.2, Weave, CoreDNS) and given the experience I acquired while architecting and developing Tectonic at CoreOS, I was pretty confident our clusters were all pretty well configured overall. But still, Kubernetes has various moving pieces, and subtle issues may be introduced at any of those layers – where might that be this time? Unlike the OpenStack world, switching a component by another in Kubernetes is fairly easy thanks to the amazingly simple interfaces the system is built onto (e.g. CNI), and the incredible implementation diversity that the community offers. I, therefore, decided to invest an hour using iptables rather than IPVS, then replacing Weave by Calico, and CoreDNS by the older KubeDNS, in order to rule them out – no luck, the issue is still there. I noticed however that using Weave without fastdp made the issue disappear, but there is no such thing as using Weave without fastdp.

I then remembered various networking issues we had at CoreOS while working on Tectonic for Azure, due to a TX checksum offloading malfunction in their hypervisors / LB implementation, unlikely relevant here. The clusters and their nodes were mostly idle, the ARP tables looked totally fine: without stale entries (as it occurred with kube-proxy before), and not full. Back to square one.

Going deeper, and discovering a time-saving feature of CoreDNS

Looking at the facts again, the issue occurs with some applications, but not all of them, most of the time, but not always. The base image of the containers does not seem to have an effect on the numbers. There were a few issues opened here and there during the past few months about DNS latency, some of them totally unrelated (e.g. scalability, misconfiguration, ARP tables being full).

What is the difference between an affected application, like cURL, and an application that worked totally fine? I opened a few tcpdump sessions on the different nodes and containers in the path of cURL in an attempt to answer this question, and understand the problem better.

Reading the container’s tcpdump capture, two lookups were made by libc spaced by only a few CPU cycles, for A and AAAA records. While the responses for the A queries were coming quickly, the AAAA queries did not seem to be answered in a timely manner and were repeated after five seconds. IPv6 is disabled everywhere across our clusters, at the kernel level and our network interfaces do not even have link-local addresses. The reason why would the applications or libc make AAAA lookups got me a somewhat confused, but I could imagine potential use-cases, moving on. Reading the DNS server’s capture, IPv6 turned out to be irrelevant, as the server would not even receive the most of the packets containing the AAAA queries, which are transported over UDP with IPv4. When it did receive them, it would query the upstream server in a similar fashion. So, the only thing IPv6 about those AAAA lookups is the fact that it is looking for IPv6 responses, and nothing else.

Because the resolv.conf file of Kubernetes’ containers has numerous search domains and ndots:5, libc generally have to look up several composed names before getting a positive result, unless the requested domain is fully-qualified and has a trailing dot, which most applications do not use. For example, to resolve google.com, google.com.kube-system.svc.cluster.local., google.com.svc.cluster.local., google.com.cluster.local., google.com.ec2.internal. and finally google.com. must be looked up, for both A and AAAA records. That’s a lot of hops, especially when most of the AAAA requests time out after five seconds and must be retried. I discovered that CoreDNS can actually limit the number of roundtrips required, thanks to its autopath feature, which automatically detect queries being made with a known Kubernetes suffix, iterate server-side through the usual search domains, and leverage its own knowledge/cache about the available Kubernetes services to find a valid one (or fallback to querying the upstream server), to finally return both a CNAME containing the actual domain name found to have a valid, and an A/AAAA response with the actual IP address for that domain name (or NXDOMAIN if the record does not exist, obviously). I was baffled to see how smart and convenient that was, such an easy win.

Default resolv.conf on Kubernetes pods nameserver 172.17.0.10 search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal options ndots:5 1 2 3 nameserver 172.17.0.10 search default . svc . cluster . local svc . cluster . local cluster . local ec2 . internal options ndots : 5

CoreDNS Configuration .:5353 { errors log health reload kubernetes cluster.local 172.16.0.0/16 172.17.0.0/16 { pods verified resyncperiod 1m fallthrough } cache 10 cluster.local 172.16.0.0/16 172.17.0.0/16 autopath @kubernetes proxy . /etc/resolv.conf prometheus 0.0.0.0:9153 } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 . : 5353 { errors log health reload kubernetes cluster . local 172.16.0.0 / 16 172.17.0.0 / 16 { pods verified resyncperiod 1m fallthrough } cache 10 cluster . local 172.16.0.0 / 16 172.17.0.0 / 16 autopath @ kubernetes proxy . / etc / resolv . conf prometheus 0.0.0.0 : 9153 }

CoreDNS's Autopath shortcuts search domains, but time outs still occur 19:27:05.990180 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58) 19:27:05.990253 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58) 19:27:05.990258 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58) 19:27:06.103767 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 3888 2/0/0 CNAME google.com., A 172.217.15.110 (98) 19:27:10.994773 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58) 19:27:10.994791 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58) 19:27:10.995299 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 3888 2/0/0 CNAME google.com., A 172.217.15.110 (98) 19:27:10.995330 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58) 19:27:10.995337 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58) 19:27:11.100456 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 58213 2/0/0 CNAME google.com., AAAA 2a00:1450:8003::69 (110) 1 2 3 4 5 6 7 8 9 10 19 : 27 : 05.990180 IP ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 > ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain : 3888 + A ? google . com . kube - system . svc . cluster . local . ( 58 ) 19 : 27 : 05.990253 IP ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 > ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain : 58213 + AAAA ? google . com . kube - system . svc . cluster . local . ( 58 ) 19 : 27 : 05.990258 IP ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 > ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain : 58213 + AAAA ? google . com . kube - system . svc . cluster . local . ( 58 ) 19 : 27 : 06.103767 IP ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain > ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 : 3888 2 / 0 / 0 CNAME google . com . , A 172.217.15.110 ( 98 ) 19 : 27 : 10.994773 IP ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 > ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain : 3888 + A ? google . com . kube - system . svc . cluster . local . ( 58 ) 19 : 27 : 10.994791 IP ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 > ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain : 3888 + A ? google . com . kube - system . svc . cluster . local . ( 58 ) 19 : 27 : 10.995299 IP ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain > ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 : 3888 2 / 0 / 0 CNAME google . com . , A 172.217.15.110 ( 98 ) 19 : 27 : 10.995330 IP ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 > ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain : 58213 + AAAA ? google . com . kube - system . svc . cluster . local . ( 58 ) 19 : 27 : 10.995337 IP ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 > ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain : 58213 + AAAA ? google . com . kube - system . svc . cluster . local . ( 58 ) 19 : 27 : 11.100456 IP ip - 172 - 17 - 0 - 10.us - east - 2.compute.internal.domain > ip - 172 - 16 - 0 - 6.us - east - 2.compute.internal.52031 : 58213 2 / 0 / 0 CNAME google . com . , AAAA 2a00 : 1450 : 8003 :: 69 ( 110 )

An initial workaround

This did not solve the root cause though, as we are still seeing AAAA lookups taking up to five seconds.

After a bit of digging, I read in the man(5) page for resolv.conf that two options relevant to the parallel lookup mechanism used by glibc are available: single-request and single-request-reopen, which both enable sequential lookups. After specifying any of those options, using the relatively new dnsConfig configuration block (Alpha in Kubernetes 1.9), I could finally only see sub-second queries and got immediately excited about the fact that I would simply be able to add this to our templates and call it a day. I applied the changes and happily went home, too late anyway.

glibc's single-request & single-request-reopen single-request (since glibc 2.10) Sets RES_SNGLKUP in _res.options. By default, glibc performs IPv4 and IPv6 lookups in parallel since version 2.9. Some appliance DNS servers cannot handle these queries properly and make the requests time out. This option disables the behavior and makes glibc perform the IPv6 and IPv4 requests sequentially (at the cost of some slowdown of the resolving process). single-request-reopen (since glibc 2.9) Sets RES_SNGLKUPREOP in _res.options. The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly sends back only one reply. When that happens the client system will sit and wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket and open a new one before sending the second request. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 single-request (since glibc 2.10) Sets RES_SNGLKUP in _res.options. By default, glibc performs IPv4 and IPv6 lookups in parallel since version 2.9. Some appliance DNS servers cannot handle these queries properly and make the requests time out. This option disables the behavior and makes glibc perform the IPv6 and IPv4 requests sequentially (at the cost of some slowdown of the resolving process). single-request-reopen (since glibc 2.9) Sets RES_SNGLKUPREOP in _res.options. The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly sends back only one reply. When that happens the client system will sit and wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket and open a new one before sending the second request.

Adding the single-request-reopen option to a Kubernetes pod dnsConfig: options: - name: single-request-reopen 1 2 3 dnsConfig : options : - name : single-request-reopen

Setback & netfilter race conditions

This was until I discovered that the workaround had no effect on Alpine containers. It was at this moment that I knew musl was going to give me a hard time, again, I should have known better. Their resolver only supports ndots, attempts, and timeout, awesome. I went to talk to Rich Felker on #musl only to learn that no change would be made, as sequential lookups are against their architecture, and because, according to other users on the IRC channel, Kubernetes’ use of ndots is a heresy anyways. Wherever the actual issue is (may it be the general concept of Kubernetes’ networking), it should be fixed there.

Sequential queries work, not parallel ones, sometimes, but not always. That’s got to be a race condition, with the number of networking trickeries that Kubernetes do to get packets from one end to the other, it would not be too surprising after all. After some additional research, I found some existing literature about netfilter race conditions, such as this one or that one. Looking at conntrack -S, we had thousands of insert_failed, this is it. It turns out that a few engineers have noticed the issue and have gone through the troubleshooting process as well, identifying a SNAT race condition, ironically briefly documented in netfilter’s code. The solution would be to add –random-fully on all masquerading rules, which are set by several components in Kubernetes: kubelet, kube-proxy, weave and docker itself. There is only one little problem here… This is an early feature and not available on Container Linux, nor in Alpine’s iptables package nor in the Go wrapper of iptables. Regardless, it seems generally accepted that this would be the solution to the issue, and some developers are now implementing the missing flag support, but behold, this does not stop here.

Based on various traces, Martynas Pumputis discovered that there also was a race with DNAT, as the DNS server is reached via a virtual IP. Due to UDP being a connectionless protocol, connect(2) does not send any packet and therefore no entry is created in the conntrack hash table. During the translation, the following netfilter hooks are called in order: nf_conntrack_in (creates conntrack hash object, adds it to the unconfirmed entries list), nf_nat_ipv4_fn (does the translation, updates the conntrack tuple), and nf_conntrack_confirm (confirms the entry, adds it to the hash table). The two parallel UDP requests race for the entry confirmation and end up using different DNS endpoints, as there are multiple DNS server replicas available. Therefore, insert_failed is incremented, and the request is dropped. This means that adding –random-fully does not mitigate the packet loss, as the flag would only help mitigate the SNAT race! The only reliable fix would be to patch netfilter directly, which Martynas Pumputis is currently attempting to do.

A short and efficient workaround

Getting a patch into the kernel, and having it released, is not something that happen overnight. I, therefore, started writing my own workaround, based on all the knowledge gathered while troubleshooting the issue. Fortunately, I learned how to use tc(8) back then when I was administrating a large infrastructure of containers for my startup Harmony Hosting, in order to provide bandwidth guarantees to our customers and help to mitigate DDoS attacks. Coping with such race condition requires nothing but introducing a small amount of artificial latency to every AAAA packets. Using iptables, we can mark UDP traffic destined to the port exposing our DNS server, that have the DNS query bits set (inexpensive check) and that contain at least one question with QTYPE=AAAA. We need to be cautious due to the existing marks, and use a proper mask. With tc, we can route the marked traffic using a two bands priomap to a netem that will introduce a few milliseconds worth of latency, and the rest to a standard fq_codel. Additionally, we need to do our DPI and traffic shaping on the right interface, as Weave will encapsulate and encrypt traffic using IPSec (ESP), obfuscating everything. The good news though is that the Weave interface is a virtual interface and is therefore set to noqueue by default, we won’t need to worry about mq or about grafting qdiscs to specific TX/RX queues or CPU cores, which makes the script extremely simple.

Traffic shaping AAAA queries to workaround netfilter's races # Force the kernel to re-create the dummy mq scheduler on the default interface, # - as the child qdiscs may have been set to pfifo_fast at boot even if the default # appear to be ‘fq_codel’ (we also set the default to fq_codel regardless, for older # systems) # - as the qdiscs are using a quantum based on the boot MTU, which may have changed # after DHCP has gotten the proper MTU. # # Setting mq will only work if the NIC supports multiple TX/RX queues, therefore # creating and grafting each class/qdiscs to specific CPU cores. In case the NIC # does not support that, we simply ignore the error. sysctl -w net.core.default_qdisc=fq_codel tc qdisc del dev $(route | grep '^default' | grep -o '[^ ]*$') root 2>/dev/null || true tc qdisc add dev $(route | grep '^default' | grep -o '[^ ]*$') root handle 0: mq || true # Traffic leaving the weave interface onto the default interface will be encapsulated # and encrypted in IPSec (ESP), therefore, we may only do traffic shaping work on this # interface. # # The weave interface is a virtual interface, which is set to noqueue by default and does # not support mq nor multiq. Therefore, we go directly to the point and create a a 2-bands # priomap, that sends all traffic (regardless of the TOS octet) to the 2nd band, a simple # fq_codel. We then define the 1st band as a netem with the a small delay, that appears to # be avoid the race in a statistically satisfying manner, and that is controlled by a pareto # distribution (k=4ms, a=1ms) and route traffic marked by 0x100/0x100 to it. # # Using iptables, we mark 0x100/0x100 the UDP traffic destined to port 5353, that have the # DNS query bits set (fast check) and then that contain at least one question with QTYPE=AAAA. while ! ip link | grep "weave:" > /dev/null; do sleep 1; done tc qdisc del dev weave root 2>/dev/null || true tc qdisc add dev weave root handle 1: prio bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 tc qdisc add dev weave parent 1:2 handle 12: fq_codel tc qdisc add dev weave parent 1:1 handle 11: netem delay 4ms 1ms distribution pareto tc filter add dev weave protocol all parent 1: prio 1 handle 0x100/0x100 fw flowid 1:1 iptables -A POSTROUTING -t mangle -p udp --dport 5353 -m string -m u32 --u32 "28 & 0xF8 = 0" --hex-string "|00001C0001|" --algo bm --from 40 -j MARK --set-mark 0x100/0x100 while sleep 3600; do :; done 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 # Force the kernel to re-create the dummy mq scheduler on the default interface, # - as the child qdiscs may have been set to pfifo_fast at boot even if the default # appear to be ‘fq_codel’ (we also set the default to fq_codel regardless, for older # systems) # - as the qdiscs are using a quantum based on the boot MTU, which may have changed # after DHCP has gotten the proper MTU. # # Setting mq will only work if the NIC supports multiple TX/RX queues, therefore # creating and grafting each class/qdiscs to specific CPU cores. In case the NIC # does not support that, we simply ignore the error. sysctl - w net .core .default_qdisc = fq_codel tc qdisc del dev $ ( route | grep '^default' | grep - o '[^ ]*$' ) root 2 > / dev / null || true tc qdisc add dev $ ( route | grep '^default' | grep - o '[^ ]*$' ) root handle 0 : mq || true # Traffic leaving the weave interface onto the default interface will be encapsulated # and encrypted in IPSec (ESP), therefore, we may only do traffic shaping work on this # interface. # # The weave interface is a virtual interface, which is set to noqueue by default and does # not support mq nor multiq. Therefore, we go directly to the point and create a a 2-bands # priomap, that sends all traffic (regardless of the TOS octet) to the 2nd band, a simple # fq_codel. We then define the 1st band as a netem with the a small delay, that appears to # be avoid the race in a statistically satisfying manner, and that is controlled by a pareto # distribution (k=4ms, a=1ms) and route traffic marked by 0x100/0x100 to it. # # Using iptables, we mark 0x100/0x100 the UDP traffic destined to port 5353, that have the # DNS query bits set (fast check) and then that contain at least one question with QTYPE=AAAA. while ! ip link | grep "weave:" > / dev / null ; do sleep 1 ; done tc qdisc del dev weave root 2 > / dev / null || true tc qdisc add dev weave root handle 1 : prio bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 tc qdisc add dev weave parent 1 : 2 handle 12 : fq_codel tc qdisc add dev weave parent 1 : 1 handle 11 : netem delay 4ms 1ms distribution pareto tc filter add dev weave protocol all parent 1 : prio 1 handle 0x100 / 0x100 fw flowid 1 : 1 iptables - A POSTROUTING - t mangle - p udp -- dport 5353 - m string - m u32 -- u32 "28 & 0xF8 = 0" -- hex - string "|00001C0001|" -- algo bm -- from 40 - j MARK -- set - mark 0x100 / 0x100 while sleep 3600 ; do : ; done

Finally, we can build a very simple container image with the iproute2 package only, and run it alongside the Weave’s containers in its DaemonSet.

Setting up the traffic shaping on every nodes, as part of Weave's DaemonSet - name: weave-tc image: 'qmachu/weave-tc:0.0.1' securityContext: privileged: true volumeMounts: - name: xtables-lock mountPath: /run/xtables.lock - name: lib-tc mountPath: /lib/tc 1 2 3 4 5 6 7 8 9 - name : weave-tc image : 'qmachu/weave-tc :0.0.1' securityContext : privileged : true volumeMounts : - name : xtables-lock mountPath : /run/xtables.lock - name : lib-tc mountPath : /lib/tc

Conclusion

All in all, given the current adoption of Kubernetes, it is quite surprising that only a few Kubernetes engineers noticed this omnipresent and highly disruptive issue, which may be because networking conditions may not be as favorable everywhere for that race, or a symptom of a lack of monitoring overall.

However, I am thrilled to see that we ended up with a workaround that consists of 10 lines of bash and 10 lines of YAML, that do not require maintaining patches anywhere, or pushing any changes down to our users, and that reduce the likelihood of the races happening down to far less than a percent. And along the way, we also picked up a change that truncates the number of DNS roundtrips dramatically!

Edit: As mentioned by Duffie Cooley, it would also be possible to run the DNS server on every nodes using a DaemonSet, and specify the node’s IP as the clusterDNS in kubelet’s configuration. This solution is unfortunately unusable for us, as containers with cluster-wide permissions (even read-only) are unable to run on our worker nodes, and as containers do not have direct network access to any of our nodes for security reasons.