The Rapid7 Project Sonar datasets are amazing resources. They represent scans across the internet, compressed and easy to download. This blog post will focus on two of these datasets:

https://opendata.rapid7.com/sonar.rdns_v2/ (rdns)

https://opendata.rapid7.com/sonar.fdns_v2/ (fdns_a)

Unfortunately, working with these datasets can be a bit slow as the rdns and fdns_a datasets each contain over 10GB of compressed text. My old workflow for using these datasets was not efficient:

ubuntu@client:~$ time gunzip -c fdns_a.json.gz | grep "erbbysam.com"

{"timestamp":"1535127239","name":"blog.erbbysam.com","type":"a","value":"54.190.33.125"}

{"timestamp":"1535133613","name":"erbbysam.com","type":"a","value":"104.154.120.133"}

{"timestamp":"1535155246","name":"www.erbbysam.com","type":"cname","value":"erbbysam.com"}

real 11m31.393s

user 12m29.212s

sys 1m37.672s

I suspected there had to be a faster way of searching these two datasets.

(TLDR, reverse and sort domains then binary search)

DNS Structure

A defining features of the DNS system is its tree-like structure. Visiting this page, you are three levels below the root domain:

com

com.erbbysam

com.erbbysam.blog

The grep query above looks for a domain name tied to the root domain, not an arbitrary string in the file. If we could shape our dataset into a DNS tree, an equivalent lookup would just require a quick traversal of this tree.

Binary Search

The task of transforming a large dataset into a tree on disk and traversing this tree can be simplified further using a binary search algorithm.

The first step in using a binary search algorithm is to sort the data. One option, matching for format above, is the form “com.erbbysam.blog”. This would require a slightly more complex DNS reversal algorithm than neccessary. To simplify, reverse each line instead:

moc.masybbre.golb,521.33.091.4

moc.masybbre,331.021.451.40

moc.masybbre.www,moc.masybbre

There are no one-command solutions to sort a dataset that does not fit into memory (that I am aware of). To sort these large files, split the data into sorted chunks and then merge the results together:

# fetch the fdns_a file wget -O fdns_a.gz https://opendata.rapid7.com/sonar.fdns_v2/2019-01-25-1548417890-fdns_a.json.gz # extract and format our data gunzip -c fdns_a.gz | jq -r '.value + ","+ .name' | tr '[:upper:]' '[:lower:]' | rev > fdns_a.rev.lowercase.txt # split the data into chunks to sort # via https://unix.stackexchange.com/a/350068 -- split and merge code split -b100M fdns_a.rev.lowercase.txt fileChunk # remove the old files rm fdns_a.gz rm fdns_a.rev.lowercase.txt # Sort each of the pieces and delete the unsorted one # via https://unix.stackexchange.com/a/35472 -- use LC_COLLATE=C to sort ., chars for f in fileChunk*; do LC_COLLATE=C sort "$f" > "$f".sorted && rm "$f"; done # merge the sorted files with local tmp directory mkdir -p sorttmp LC_COLLATE=C sort -T sorttmp/ -muo fdns_a.sort.txt fileChunk*.sorted # clean up rm fileChunk*

More detailed instructions for running this script and the rdns equivalent can be found here:

https://github.com/erbbysam/DNSGrep#run

DNSGrep

Now we can search the data! To accomplish this, I built a simple golang utility that can be found here:

https://github.com/erbbysam/DNSGrep

ubuntu@client:~$ ls -lath fdns_a.sort.txt -rw-rw-r-- 1 ubuntu ubuntu 68G Feb 3 09:11 fdns_a.sort.txt ubuntu@client:~$ time ./dnsgrep -f fdns_a.sort.txt -i "erbbysam.com" 104.154.120.133,erbbysam.com 54.190.33.125,blog.erbbysam.com erbbysam.com,www.erbbysam.com real 0m0.002s user 0m0.000s sys 0m0.000s

That is significantly faster!

The algorithm is pretty simple:

Use a binary search algorithm to seek through the file, looking for a substring match against the query. Once a match is found, the file is scanned backwards in 10KB increments looking for a non-matching substring. Once a non-matching substring is found, the file is scanned forwards until all exact matches are returned.

PoC

PoC disclaimer: There is no uptime/performance guarantee of this service and I likely will take this offline at some point in the future. Keep in mind that the datasets here are from a scan on 1/25/19 — DNS records may have changed by the time you read this.

As these queries are so quick, I set up an AWS EC2 t2.micro instance with a spinning disk (Cold HDD sc1) and hosted a server that allows queries into these datasets:

https://github.com/erbbysam/DNSGrep/tree/master/experimentalServer

https://dns.bufferover.run/dns?q=erbbysam.com

ubuntu@client:~$ curl 'https://dns.bufferover.run/dns?q=erbbysam.com' { "Meta": { "Runtime": "0.000361 seconds", "Errors": [ "rdns error: failed to find exact match via binary search" ], "FileNames": [ "2019-01-25-1548417890-fdns_a.json.gz", "2019-01-30-1548868121-rdns.json.gz" ], "TOS": "The source of this data is Rapid7 Labs. Please review the Terms of Service: https://opendata.rapid7.com/about/" }, "FDNS_A": [ "104.154.120.133,erbbysam.com", "54.190.33.125,blog.erbbysam.com", "erbbysam.com,www.erbbysam.com" ], "RDNS": null }

Having a bit of fun with this, I queried every North Korean domain name, grepping for the IPs not in North Korean IP space:

https://dns.bufferover.run/dns?q=.kp

ubuntu@client:~$ curl 'https://dns.bufferover.run/dns?q=.kp' 2> /dev/null | grep -v "\"175\.45\.17" { "Meta": { "Runtime": "0.000534 seconds", "Errors": null, "FileNames": [ "2019-01-25-1548417890-fdns_a.json.gz", "2019-01-30-1548868121-rdns.json.gz" ], "TOS": "The source of this data is Rapid7 Labs. Please review the Terms of Service: https://opendata.rapid7.com/about/" }, "FDNS_A": [ "175.45.0.178,ns1.portal.net.kp", ], "RDNS": [ "185.33.146.18,north-korea.kp", "66.23.232.124,sverjd.ouie.kp", "64.86.226.78,ns2.friend.com.kp", "103.120.178.114,dedi.kani28test.kp", "198.98.49.51,gfw.kp", "185.86.149.212,hey.kp" ] }

That’s it! Hopefully this was useful! Give it a try: https://dns.bufferover.run/dns?q=<hostname>

