TL;DR

It wasn’t a network problem

At first glance, it seems likely that the issues would be on networks inside of our environments — which is exactly why a network engineer was asked to troubleshoot this issue.

Step 1: Try to analyze the network data

Analyzation offered no interesting logs, no alerts, no anomalies, and no packet drops. The CPU load usage for network devices were low, buffers were empty, and the physical networks were far off from over-utilized. Microbursts can happen, but TCP usually handles them.

Step 2: Look for network problems on end devices

Nothing interesting was found here — no drops, not much traffic, not a lot of sessions in general, no logs on Kubernetes side, and no errors in Calico (our Kubernetes CNI).

Step 3: Inspect Redis configuration

This inspection was mostly to look for config lines that could affect existing sessions, things like timeouts, keepalives, session expiration, etc. We didn’t find anything interesting here either, only a standard config like other clusters (except other clusters weren’t experiencing such problems). Other services outside Kubernetes didn’t have similar issues so, most likely, Redis isn’t to blame. Moving along…

Step 4: Check TCP communication

We checked the TCP keepalives on both sides, just to make sure that the idle session wouldn’t drop by itself. The number of sessions from each pod and from Redis were also checked: 1 established session per pod and everything looks ideal.

Step 5: Check protocol communication

It could be that Redis just doesn’t like some messages from our service and then drops the connection, that could explain the behavior. How do you check which circumstances causes Redis to close the session? The easiest way we found was using good old tcpdump. The problem here is that events are quite rare and random so it takes time to catch the interesting traffic, and there’s a lot of data to store. Luckily tcpdump supports cycling write to file, something like this:

tcpdump -w redis.pcap -C 500 -W 1 -n

After spending some time waiting, we caught the reset event to traffic dump!

The reset was on a new session with only a few packets and the old session just stopped sending traffic right before the new session appeared. While trying to understand where that new session came from, we compared TCP sequence and acknowledgment numbers in TCP headers — it was the same session, yet the source ports were different!

Why?

Kubernetes decided that the old NAT session was over, and that it needed to create a new one. There were no FIN’s (packets that show that TCP wants to close the connection) or unacknowledged segments (missing packets). TCP timeouts? Definitely worthwhile to check sysctl values, it may be that the TCP timeouts were configured to be too small — no, TCP timeouts are 24h and we have keepalives every 5 minutes.

Maybe the max number of sessions were reached? Again no, we’re far away from sessions being maxed. Kubernetes uses network namespaces and they can have their own parameters so we checked values again for network namespaces — values were the same or even larger.

Step 6: Who is killing NAT sessions?

We began logging conntrack events for Redis connection, since the event is quite rare it took awhile to catch it, but eventually the closed connections appeared in our logs.

Conntrack shows that the sessions were simply deleted, as if we had deleted them with a command. We tried to reproduce the behavior and were able to produce the same amount of the same errors. So why were the sessions getting deleted in the middle of the day without a reason? What can be doing this? Since Kubernetes CNI can manage them, some quick Googling found that yes, calico indeed has the code:

c.newCmd("conntrack", "--family", family, "--delete", direction, ipAddr.String())

Looks like we found an issue, but why does this happen? I spent some time getting to know calicos’ source code and found out that this function is used to delete stale connections when spawning new pods with this IP. Thing is, we don’t spawn any pods with the same IP, as we would have far more serious issues if we did!

Step 7: Look for a bug in the Kubernetes CNI

The search is on!

After spending a lot of time on the internets, we found a lot of bugs similar to ours but they were either not exactly the same as ours or they had been fixed in our version of CNI.

Assuming that it’s still a calico bug, it would affect other containers as well. To test that theory, we checked conntrack events not only for Redis, but for all connections from the pod and all were eventually deleted at once!

After this, we spent some time in the code trying to find what else could trigger this function, but with no luck.

We enabled calico debug messages on one single node. It took some time until problems occurred there, but it was worth it. In tons of useless messages we found a couple of interesting ones — 2 messages with 10 ms difference saying that the IP address on the interface had changed. Why does Kubernetes think that the VM IP was changed when it’s been the same for a month already? Checking systemd-networkd logs and here we have:

systemd-networkd: DHCP lease lost

Step 8: DHCP deep dive

Okay, so we have VM with Kubernetes pods inside that are losing their IP address which triggers deleting all NAT sessions associated with this IP. Problem is, my networking background tells me that it’s not quite right and that DHCP lease should be renewed without removing IP from the interface. Did VM even try to renew the DHCP lease?

We went to the Openstack DHCP server logs and the logs were showing that sometimes we would have DHCPNAK for Kubernetes nodes, which is a problem. When we have DHCPDISCOVER->DHCPOFFER->DHCPREQUEST->DHCPACK, the initial request is good, but when we only have DHCPREQUEST to renew an address lease, the DHCP server replies with “DHCPNAK Lease not found”.

New leases’ work without a problem, but renewing requests fails with NAK, but for some VMs, the lease renew works fine. We quickly found that only some ubuntu VMs have problems, Debian works fine, and some Ubuntu machines were fine as well. So what’s the difference in those requests?

Tcpdump in qdhcp network namespace on OpenStack controller node will hopefully give us an answer. What we found were that refused requests contained a client-id(61) option, while successful ones didn’t have this option inside. For some reason Openstack DHCP can’t renew a lease if renew request contains a client-id. Ubuntu uses UUID as a client-id and Debian uses mac-address for the same purposes. Maybe we have that legacy DHCP server? (Spoiler: no)

Step 9: Solve the problem

The solution was simple, let’s use a mac address as client-id on Ubuntu VMs. This required some chef code to update the netplan and wait for changes to deploy across the environment.

Step 10: Solve the real problem

While applying the changed to the set of VMs we decided to check which VMs already switched to mac-addresses and what VMs still used UUID.

# ssh <VMs> 'cat /run/systemd/netif/leases/2' | grep CLIENTID

CLIENTID=ffb55e67ff00020000ab11e98bb3c5eb5d0d55 CLIENTID=ffb55e67ff00020000ab11e98bb3c5eb5d0d55 CLIENTID=ffb55e67ff00020000ab11e98bb3c5eb5d0d55 CLIENTID=ffb55e67ff00020000ab11e98bb3c5eb5d0d55 CLIENTID=ffb55e67ff00020000ab11e98bb3c5eb5d0d55

We were quite shocked when we got a huge list of the same UUIDs! From my understanding of the UUID concept, it should be unique and is based on the machine-id. The following is the next command:

# ssh <VMs> 'cat /etc/machine-id'

46a37da3142748e59807ed017994e80d

46a37da3142748e59807ed017994e80d

46a37da3142748e59807ed017994e80d

46a37da3142748e59807ed017994e80d

46a37da3142748e59807ed017994e80d

This file somehow got inside our Ubuntu image that we’re using to spin up new VMs. All the VMs used the same DHCP client-id, which obviously didn’t work — end of story.

The solution was pretty simple after all — just delete the machine-id from our image, change the machine-id on some existing VMs, or redeploy some clusters.

Summary

A few ERR_REDIS_NOT_CONNECTED errors from one service, which happened only a few times every 24 hours, lead us through a multi-day journey to understand that our Ubuntu image for Kubernetes nodes had broken. The fix ended up being extremely easy in comparison with efforts to just understand what to fix.

What could have been done faster or in a more optimized way:

When we understood that the sessions from all the pods on a node were cleared, instead of diving into the Kubernetes networking, we could have just checked the syslog of a node. Log messages in dnsmasq-dhcp: lease abandoned. We’ve seen them a lot and I even checked what triggered them, there is a comment about that situation in dnsmasq source code: “If a host is configured with more than one MAC address, it’s OK to ‘nix a lease from one of it’s MACs to give the address to another.” Unfortunately, I didn’t understand its meaning at the time, now it’s pretty obvious.

As a final note

Complex distributed systems troubleshooting can be challenging sometimes, even if you have all the logs in place. It’s such a huge amount of data that finding useful parts of it can be tricky. The root cause of problems can be very random and surprising.