We moved most of our production workload to Kubernetes and were relishing the advantages it was offering us by keeping a lot of things abstracted and not requiring any manual intervention from us … But no sooner, all these abstractions became problematic when the whole network stack of our cluster went down. We were left with no other option than to create a new cluster and migrate all our applications to it.

With this cluster wide outage, our whole infrastructure came into scrutiny and we were left clueless as to why this happened, this was when we decided to take a deep dive into k8s networking and tried to understand each and every abstracted layer from pod-pod networking using CNIs to service networking using Iptables.

Also, during this time we noticed intermittent delays in pod to pod communication and intermittent overall slowness in our web application.

Problem statement

How to understand and avoid intermittent delays usually occurring in multiple of 5 seconds in pod to pod communication

Pod A sent request at x time but the request reached Pod B at x + 5s, x + 10s or x + 15s and the overall request took a lot longer than expected.

First step first

We noticed that delays are in the multiple of 5 seconds and so we found this extremely helpful article, which explains that this delay in pod-pod communication is due to DNS lookups.

libc library in our linux kernel (or musl in alpine) which is responsible for dns resolutions sends out dns lookups for A(ipv4 lookup) and AAAA(ipv6 lookup) record on the same port. A record is returned almost immediately whereas AAAA record timeouts after some while, causing the library to think that overall lookup has timedout. This causes libc to retry after waiting for 5 seconds. Thus causing dns resolution to be delayed by multiple of 5 seconds.

This delay is common in kernels where DNAT or SNAT translations are taking place and is caused by a race condition in conntrack, for understanding it in more depth follow this blog

Resolution

Linux

With the help of article mentioned above, we simply added single-request-reopen in our pod dns config and voila! our intermittent delays were gone. Just to be sure, we noticed tcpdump output and saw that for dns lookups — A and AAAA record lookups were going via different ports.

Alpine

We didn’t change anything for alpine images as most of our applications are in linux. Although you can follow this blog if you notice intermittent delays in alpine images

GRPC

For grpc applications we were still noticing DEADLINE_EXCEEDED errors when the timeout was set to 3 seconds. Dns lookup taking more than 3 seconds is a lot. So, we did these following steps:

We added single-request-reopen to the pod spec and captured tcpdump output. We saw that lookups for A and AAAA records were going via same port We figured out that for grpc, default dns resolver is ares (if it’s available) and grpc was not using libc for it’s resolution hence our setting was not being taken into consideration Lucky for us, this dns resolver can be easily overridden by setting this variable GRPC_DNS_RESOLVER=native . We added this to our Dockerfiles and lo and behold! problem solved. We captured the tcpdump output and saw that lookups for A and AAAA records were going via different ports.

Conclusion

To avoid intermittent delays of 5,10,15 seconds or DEADLINE_EXCEEDED error in grpc applications:

For linux add single-request-reopen to your deployment spec

spec.template.spec:

dnsConfig:

options:

- name: single-request-reopen

2. For GRPC application in linux — add single-request-reopen in your deployment spec as mentioned above and set the following variable GRPC_DNS_RESOLVER=native

3. For alpine follow this blog.

PS: You can check tcpdump output using the following command:

tcpdump port 53

Output when lookups for A and AAAA records are going via same port with retry after 5 second

18:09:09.252168 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 10118+ AAAA? sample-svc.prod.svc.cluster.local.prod.svc.cluster.local. (84)

18:09:09.252433 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 32141+ A? sample-svc.prod.svc.cluster.local.prod.svc.cluster.local. (84)

18:09:09.277656 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 10118 NXDomain*- 0/1/0 (177)

18:09:09.277731 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 64566+ AAAA? sample-svc.prod.svc.cluster.local.svc.cluster.local. (79)

18:09:09.279051 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 64566 NXDomain*- 0/1/0 (172)

18:09:09.279091 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 35772+ AAAA? sample-svc.prod.svc.cluster.local.cluster.local. (75)

18:09:09.280330 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 35772 NXDomain*- 0/1/0 (168)

18:09:09.280374 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 50317+ AAAA? sample-svc.prod.svc.cluster.local.ap-southeast-1.compute.internal. (93)

18:09:09.283045 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 50317 NXDomain 0/0/0 (93)

18:09:09.283088 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 50250+ AAAA? sample-svc.prod.svc.cluster.local. (61)

18:09:09.284291 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 50250*- 0/1/0 (154)

18:09:09.934566 IP tcpdump-1565114940-8h4g5.33044 > kube-dns.kube-system.svc.cluster.local.domain: 28773+ PTR? 10.0.64.100.in-addr.arpa. (42)

18:09:09.937109 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.33044: 28773*- 1/0/0 PTR kube-dns.kube-system.svc.cluster.local. (118)

18:09:15.251741 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 32141+ A? sample-svc.prod.svc.cluster.local.prod.svc.cluster.local. (84)

18:09:15.253162 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 32141 NXDomain*- 0/1/0 (177)

18:09:15.253196 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 25412+ A? sample-svc.prod.svc.cluster.local.svc.cluster.local. (79)

18:09:15.254387 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 25412 NXDomain*- 0/1/0 (172)

18:09:15.254457 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 12261+ A? sample-svc.prod.svc.cluster.local.cluster.local. (75)

18:09:15.255646 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 12261 NXDomain*- 0/1/0 (168)

18:09:15.255668 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 23917+ A? sample-svc.prod.svc.cluster.local.ap-southeast-1.compute.internal. (93)

18:09:15.258122 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 23917 NXDomain 0/0/0 (93)

18:09:15.258151 IP tcpdump-1565114940-8h4g5.36319 > kube-dns.kube-system.svc.cluster.local.domain: 24648+ A? sample-svc.prod.svc.cluster.local. (61)

18:09:15.259403 IP kube-dns.kube-system.svc.cluster.local.domain > tcpdump-1565114940-8h4g5.36319: 24648*- 1/0/0 A 100.71.57.175 (120)

Output when lookups for A and AAAA records are going via different ports