Or why do my DNS queries no longer resolve?

I use a Pi-Hole as a DNS Server on my home network and was enjoying the lack of adverts and geeking out over being able to see all the DNS queries my devices make, but then whilst watching Netflix on my PS4 all the other devices on the network stopped working.

I noticed Chrome was hanging on on resolving host and immediately suspected the Pi-Hole was malfunctioning somehow.

I was feeling lazy and so just rebooted the Pi and it started working again but only for a few minutes. Clearly this wasn’t some random flake that I could fix with a quick reboot and I was going to have to workout what was actually wrong.

A quick Google led me to — https://discourse.pi-hole.net/t/pihole-intermittent-dns-failures-solved-hardware-issue/10980/25

Which suggested that other people had seen intermittent issues on failing hardware, I’ve had terrible luck with MicroSD cards failing so was willing to believe this could be it.

A quick browse of /var/log/syslog showed nothing interesting, dmesg did contain many warnings about

[30892.350075] Under-voltage detected! (0x00050005)

So I grabbed a different charger, rebooted the Pi and unpaused Sabrina. A short while later DNS resolution was broken again, but at least the warnings about under-voltage had gone!

I Googled some more around PS4s breaking DNS but found nothing and now I was questioning if it was actually my PS4 and or / Netflix breaking the Pi-Hole.

I tried Netflix on a Macbook and it worked fine and I was now more skeptical about if the PS4 was causing it.

I disconnected all the other devices from the Pi-Hole apart from the PS4 and a laptop (so I could check if DNS resolution was working easily) and flipped the PS4 on.

Browsing around the UI didn’t break the Pi-Hole and so I tried Amazon Prime video to see if that would break anything, it didn’t.

I flipped over to Netflix and once again my DNS resolution stopped working. I was pretty sure now that something Netflix on the PS4 did was breaking the Pi-Hole and wondered if you could kill the Pi-Hole with a special DNS query.

Tailing the PiHole logs whilst turning Netflix on showed a whole flurry of requests going to the PiHole and my first thought was maybe it was getting overwhelmed with requests.

I wrote a quick Python script to make lots of DNS requests in quick succession and let it run:

import dns.resolver resolver = dns.resolver.Resolver() for x in range(2, 10):

for i in range(85, 200):

try:

print resolver.query(chr(i) * x + '.com', 'A')

except:

pass

It generated far, far more DNS queries than Netflix did and worked fine, so that was that theory out the window.

Next I pondered if performing a lookup on one of the Netflix domains returned a response that killed the Pi-Hole.

I grabbed a whole bunch of them from the logs and using dig started becoming well versed in Netflix’s IPs.

Aarons-iMac:Desktop aaronkalair$ dig nrdp.nccp.netflix.com ; <<>> DiG 9.10.6 <<>> nrdp.nccp.netflix.com

;; global options: +cmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29456

;; flags: qr rd ra; QUERY: 1, ANSWER: 10, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION:

; EDNS: version: 0, flags:; udp: 1452

;; QUESTION SECTION:

;nrdp.nccp.netflix.com. IN A ;; ANSWER SECTION:

nrdp.nccp.netflix.com. 10 IN CNAME nrdp.nccp.geo.netflix.com.

nrdp.nccp.geo.netflix.com. 10 IN CNAME nrdp.nccp.us-east-1.prodaa.netflix.com.

nrdp.nccp.us-east-1.prodaa.netflix.com. 10 IN A 52.0.189.221

nrdp.nccp.us-east-1.prodaa.netflix.com. 10 IN A 52.44.168.100

nrdp.nccp.us-east-1.prodaa.netflix.com. 10 IN A 52.44.171.111

nrdp.nccp.us-east-1.prodaa.netflix.com. 10 IN A 52.86.225.41

nrdp.nccp.us-east-1.prodaa.netflix.com. 10 IN A 52.200.236.228

nrdp.nccp.us-east-1.prodaa.netflix.com. 10 IN A 52.207.201.71

nrdp.nccp.us-east-1.prodaa.netflix.com. 10 IN A 54.174.141.2

nrdp.nccp.us-east-1.prodaa.netflix.com. 10 IN A 35.172.153.212 ;; Query time: 189 msec

;; SERVER: 192.168.1.82#53(192.168.1.82)

;; WHEN: Sat Dec 29 23:03:01 GMT 2018

;; MSG SIZE rcvd: 619

… repeat many times for many domains

They all worked fine, so I had to start digging deeper. I wondered if there was a rogue DNS request that would take out the Pi-Hole and not even get logged, so I turned to tcpdump to capture the network traffic coming to the Pi.

For the captures to make sense, it helps to understand what my setup looks like, I run Pi-Hole in a Docker container with Cloudflared as the upstream DNS resolver also running in a container alongside it.

The Pi-Hole exposes Port 53 UDP / TCP to the host Pi and other devices on the network have the Raspberry Pis IP address set as there DNS server.

The packet capture was largely uninteresting lots of queries and responses.

No domains I hadn’t seen before, I repeated the capturing a few times trying to spot a pattern, I noticed that a lookup for

api-global.netflix.com

Always seemed to happen at around the time it stopped working

But it was returning a result correctly and no amount of performing lookups in a loop with dig could replicate the failure.

Then I spotted something which I’d glossed over before, there was a TCP connection established between the PiHole and PS4 right after the DNS query for api-global.netflix.com which appears to be performing a DNS lookup for the same address again but this time over a TCP connection.

I wasn’t even sure if DNS over TCP was a thing but a quick Google suggests that it indeed is, https://tools.ietf.org/html/rfc7766

And some more Googl’ing suggests that it’s used for responses over 512 bytes

Interestingly the response for api-global.netflix.com is 674 bytes, but it does return the answer successfully 🤷.

But there are no errors here and this is a standard so why is it killing my PiHole?

Scrolling through the packet capture some more I noticed that when I turn the PS4 off and the DNS responses start flowing again this also happens.

The TCP connection gets torn down, suggesting that the TCP connection to the Pi-Hole is held open once it’s been established (presumably to avoid the overhead of establishing the connection again for future queries) and only gets torn down once the application is closed. (Maybe it has a timeout, maybe it sends keepalives at some point, I never actually investigated)

If the Pi-Hole DNS server can only handle one request at a time then as this TCP connection has been held open maybe it just sits waiting to serve this request and can’t service the rest.

To check if this was what is happening I attached strace to the pihole-FTL process that does the actual DNS resolution on the Pi-Hole and looked at what was happening.

When everything is working it’s a steady stream of reading in data from the socket, resolving the name and writing it back out to the socket.

And when DNS resolution stops working…

We hang on a read syscall on file descriptor 15, which is indeed a socket

root@61b30b99d173:/# ls -lah /proc/509/fd/15

lrwx------ 1 root root 64 Dec 30 00:02 /proc/509/fd/15 -> 'socket:[1131721]'

(As an aside I tried various methods to map that socket back to information about the connection but none seemed to work, but we know from the packet capture what it is anyway)

So it does appear that the Pi-Hole gets stuck waiting for the open TCP connection to send it more data and not servicing any other requests.

With some new words to Google I was finally able to track down other people with this issue and confirm my theory— https://discourse.pi-hole.net/t/cloudflare-doh-netflix-problems-on-smarttv/8677/21

Apparently this happens on SmartTvs and Xbox’s also!

Interestingly the thread claims the issue is fixed in what looks like a dev commit but I’m running the latest version of Pi-Hole and still see the issue, Github suggests the commit referenced just updates docs — https://github.com/pi-hole/FTL/commit/3656ba229de502e50dcbd51143329f4652b8d532

I needed a fix to get my network back online so I removed the mapping of port 53 for TCP connections from the container to the host and now Netflix works fine on the PS4 without taking down DNS resolution for the entire network!

I haven’t noticed any weird issues not being able to resolve DNS queries over TCP which is nice. If I do start having issues because of this, I’ll probably put Nginx with a bunch of workers in front of the Pi-Hole that can handle the persistent TCP connections and only pass work off to the Pi-Hole when necessary.

Follow me on Twitter @AaronKalair