Rapid7 publish a Reverse DNS result set every few weeks in relation to their Project Sonar Study. It's comprised of DNS pointer record responses they've seen across the entire publicly-routable IPv4 spectrum. These results won't return every hostname pointing at a given IPv4 address but they're often hostnames created by the network provider. This can be very useful for distinguishing between IPs used for residential, enterprise and cloud connectivity.

In a recent survey 37% of Python developers use the language to build web scrapping bots. A lot of these run on various cloud platforms and other hosting providers. Knowing a hostname of a given IPv4 address helps in excluding synthetic traffic from any web log analysis. This isn't a complete answer to identifying bot traffic and can certainly throw up false positives but nonetheless, it's a reasonably clear signal of infrastructure being used.

Executing DNS lookups on an ad-hoc basis in order to enrich network traffic logs can result in an unpredictable amount of network latency and running the same query in quick succession is wasteful. Batch lookups wouldn't lend themselves well to this problem as returning answers in real-time is more valuable due to the freshness of the data and its value to network operations. It's reasonable to expect that a lookup agent would hold open a connection or pool of connections to a database so for this exercise I'll sequentially look up IPv4 addresses while maintaining the same database connection.

My interest here is in seeing the performance differences between using PostgreSQL with a B-Tree index versus ClickHouse and its MergeTree engine for this use case.

Installing Prerequisites The machine I'm using has an Intel Core i5 4670K clocked at 3.4 GHz, 8 GB of DDR3 RAM and a 250 GB Samsung NVMe SSD 960 EVO connected via a PCIe expansion card. Below I'll install PostgreSQL 12.1 and ClickHouse 19.17.6.36 as well as a variety of other tools and libraries. The following was run on a fresh installation of Ubuntu 16.04.2 LTS. $ sudo apt-key adv \ --keyserver hkp://keyserver.ubuntu.com:80 \ --recv E0C56BD4 $ wget -qO- \ https://www.postgresql.org/media/keys/ACCC4CF8.asc \ | sudo apt-key add - $ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" \ | sudo tee /etc/apt/sources.list.d/clickhouse.list $ echo "deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main" \ | sudo tee /etc/apt/sources.list.d/pgdg.list $ sudo apt update $ sudo apt install \ clickhouse-client \ clickhouse-server \ jq \ libpq-dev \ pigz \ postgresql-12 \ postgresql-client-12 \ postgresql-contrib \ python-dev \ python-pip \ sqlite3 \ virtualenv I'll create a Python virtual environment with packages for PostgreSQL and ClickHouse connectivity as well as a domain name parser. $ virtualenv ~/.lookup $ source ~/.lookup/bin/activate $ pip install \ 'clickhouse-driver[lz4]' \ psycopg2 \ tldextract I'll launch ClickHouse Server. PostgreSQL already launched its server automatically when it was installed. $ sudo service clickhouse-server start I'll create a default ClickHouse client configuration file for my UNIX account. I set the password for the default ClickHouse user to root when I installed the server software. $ mkdir -p ~/.clickhouse-client $ vi ~/.clickhouse-client/config.xml <config> <password> root </password> </config>

The Reverse DNS Dataset The following is an ANSI diagram of the stages the dataset will go through in this tutorial. Original GZIP-compressed JSON from Rapid 7 (11 GB) └── JSON split into 4 files (11 GB) └── Four GZIP-compressed CSV files of full hostnames and IPv4 addresses in dot-decimal format (6 GB) └── Four uncompressed CSV files of domain name stubs and IPv4 addresses in 32-bit unsigned integer format (23 GB) └── SQLite3 Database (28 GB excluding the index) └── Shrunken uncompressed CSV (0.5 GB) ├── PostgreSQL Database (1.4 GB) └── ClickHouse Log Engine Database (0.1 GB) └── ClickHouse MergeTree Engine Database (0.3 GB) I'll download the Reverse DNS dataset from Rapid7. I've noticed they remove older versions of their database when newer versions are published so in the future you will probably have to modify the URL below with a newer one found on their downloads page. $ wget -c -O rdns.json.gz \ https://opendata.rapid7.com/sonar.rdns_v2/2019-11-27-1574813015-rdns.json.gz The dataset is delivered as a ~1.27 billion-line, 11 GB GZIP-compressed file with one JSON-serialised record per line. The following is an example of one record. $ pigz -dc rdns.json.gz \ | head -n1 \ | python -mjson.tool { "name" : "1.10.230.137" , "timestamp" : "1574835472" , "type" : "ptr" , "value" : "node-k95.pool-1-10.dynamic.totinternet.net" } I want to process this data using four processes, each will run on a separate CPU core. The dataset has around 1.2 billion lines so I'll break them into files of, at most, 320 million lines each. $ pigz -dc rdns.json.gz \ | split --lines = 320000000 \ --filter = "pigz > rdns_\$FILE.json.gz" These are the resulting files produced by the above. Each file is ~2.8 GB. rdns_xaa.json.gz rdns_xab.json.gz rdns_xac.json.gz rdns_xad.json.gz I will then extract the full domain name and IPv4 address from each file in parallel and place them in GZIP-compressed, comma-delimited CSV files. $ ls rdns_*.json.gz \ | xargs -P4 -n1 -I {} sh -c \ "pigz -dc {} \ | jq '.name + \",\" + .value' \ | sed 's/\"//g' \ | pigz > {}.csv.gz" The above produced 6 GB of GZIP-compressed CSV files. Here is an example of the output of the above operation. $ gunzip -c rdns_xaa.json.gz.csv.gz | tail -n3 104.197.172.33,33.172.197.104.bc.googleusercontent.com 104.197.172.34,34.172.197.104.bc.googleusercontent.com 104.197.172.35,35.172.197.104.bc.googleusercontent.com

Extracting Domain Name Stubs I want to remove any CNAMEs and top-level domains from the domain names. In the above example 33.172.197.104.bc.googleusercontent.com would simply become googleusercontent . This makes it easier to group domains from entities that use a wide variety of top-level domains and removes the noise of the CNAMEs which can be unique to each IPv4 address. Below I'll use tldextract for this task. The code below will also convert the IPv4 address from its dot-decimal format into an unsigned 32-bit integer. $ vi domain_stub.py import gzip from glob import glob from multiprocessing import Pool from socket import inet_aton from struct import unpack from uuid import uuid4 from tldextract import extract as tld_ex ip2int = lambda x : unpack ( "!I" , inet_aton ( x ))[ 0 ] def extract ( manifest ): input_file , output_file = manifest with open ( output_file , 'a' ) as out : for line in gzip . open ( input_file , 'r' ): try : ip , domain = line . split ( ',' ) out . write ( ' %d , %s

' % ( ip2int ( ip ), tld_ex ( domain )[ 1 ])) except : pass filenames = [( csv_gz , 'out. %s .csv' % str ( uuid4 ())[: 6 ]) for csv_gz in glob ( 'rdns_*.json.gz.csv.gz' )] pool = Pool ( 4 ) pool . map ( extract , filenames ) $ python domain_stub.py The above took over a day to run on my machine. I believe the tldextract library is well-written and reasonable steps have been taken to optimise it. Nonetheless, I did raise a ticket to see if there is anything that could be done to speed up the performance even further. I suspect a string-focused library like this could be ported to C, C++, Rust or GoLang and yield greater performance. The above produced 23 GB of uncompressed CSV files. Here is a sample from one of the resulting output files. $ head -n3 out.7dcbe9.csv 17491593,totinternet 17491594,totinternet 17491595,totinternet

Reducing the Search Space In the above example totinternet is listed as the hostname stub for at least three IPv4 addresses that run in sequential order. I want to see if I can build ranges where a hostname stub is paired with the lowest and highest IPv4 addresses that run sequentially and uninterrupted. This way there will be fewer records describing the same dataset and ultimately shrink the search space for the lookup tasks later on in this post. I'll import the dataset into a single table in SQLite3, apply an index and then go through each record ordered by the value of the IPv4 address and print out any uninterrupted sequences for any given hostname stub. $ sqlite3 lookup.db CREATE TABLE ip_host ( "ipv4" INTEGER , "domain_stub" TEXT ); The following produced a 28 GB database prior to the index being applied. $ cat out.*.csv \ | sqlite3 -csv \ -separator ',' \ lookup.db \ '.import /dev/stdin ip_host' $ sqlite3 lookup.db CREATE INDEX ip_host_ipv4 ON ip_host ( ipv4 ); $ vi shrink.py import sqlite3 lookup_conn = sqlite3 . connect ( 'lookup.db' ) lookup_cur = lookup_conn . cursor () sql = '''SELECT ipv4, domain_stub FROM ip_host ORDER BY ipv4''' lookup_cur . execute ( sql ) last_ip , last_domain , consecutive = 0 , None , 0 for ipv4 , domain_stub in lookup_cur : if ipv4 != last_ip + ( consecutive + 1 ): if last_domain : print ' %s , %d , %d ' % ( last_domain , last_ip , ( last_ip + consecutive )) last_ip = ipv4 last_domain = domain_stub consecutive = 0 else : consecutive = consecutive + 1 if consecutive : print ' %s , %d , %d ' % ( last_domain , last_ip , ( last_ip + consecutive )) $ python shrink.py > shrunken.csv The resulting CSV file is 508 MB uncompressed and is made up of 17,784,359 lines, 71x fewer than the source dataset. Here is a sample of the file produced. $ head -n3 shrunken.csv one,16777217,16777217 bigredgroup,16778244,16778244 gtelecom,16778501,16778501 I'll produce a random set of records that will be used to do lookups in the benchmarks below. This will ensure that every lookup is a hit and every query is unique. $ sort -R shrunken.csv | head -n1000 > unique_ips

Populating PostgreSQL I'll setup a PostgreSQL account and create a database that will be populated by the "shrunken" dataset. $ sudo -u postgres \ bash -c "psql -c \"CREATE USER mark WITH PASSWORD 'test' SUPERUSER;\"" $ createdb ip_ranges $ psql ip_ranges CREATE TABLE ip_ranges ( domain_stub VARCHAR ( 255 ), "start" BIGSERIAL , "end" BIGSERIAL ); \ copy ip_ranges FROM 'shrunken.csv' DELIMITER ',' CSV PostgreSQL can scan its indices both forwards and backwards at nearly the same speed but when running a scan, it cannot change direction without starting a new scan. I'll create an index on the dataset that sorts the data by the first IPv4 address in the range and then by the last IPv4 in the range in reverse. That way when there is a hit on the first IPv4 address in the range the second column's hit will always come afterword avoiding PostgreSQL having to start a second scan. CREATE INDEX ip_ranges_inverse ON ip_ranges ( "start" , "end" DESC ); I'll then re-order the table based on the ordering in the above index. ALTER TABLE ip_ranges CLUSTER ON ip_ranges_inverse ; I'll also update the statistics used by PostgreSQL's query planner. VACUUM ANALYZE ip_ranges ; The resulting database is 1,479 MB in PostgreSQL's internal format. \ l + ip_ranges List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+-------+----------+-------------+-------------+-------------------+---------+------------+------------- ip_ranges | mark | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 1479 MB | pg_default | Below I'll be using a prepared query for the lookup process. The PREPARE statement will parse the SQL, analyse it and perform any rewriting / macro expansion of the query ahead of time and only once during this exercise. Later on, when the EXECUTE statement is called the query will only need to be planned and executed saving a good amount of overhead. This post originally did not state that sequential scanning should be turned off and PostgreSQL would pick this over scanning its index. Thanks to Justin Azoff for pointing this out. EXPLAIN EXECUTE find_stub ( 2884839823 ); QUERY PLAN --------------------------------------------------------------------------------------- Limit (cost=0.00..0.10 rows=1 width=8) -> Seq Scan on ip_ranges (cost=0.00..386967.84 rows=3903622 width=8) Filter: ((start <= '2884839823'::bigint) AND ("end" >= '2884839823'::bigint)) SET enable_seqscan = off ; QUERY PLAN ---------------------------------------------------------------------------------------------------- Limit (cost=0.44..0.58 rows=1 width=8) -> Index Scan using ip_ranges_inverse on ip_ranges (cost=0.44..568625.21 rows=3903622 width=8) Index Cond: ((start <= '2884839823'::bigint) AND ("end" >= '2884839823'::bigint)) I've added this setting to the code below. $ vi run_pg.py from psycopg2 import connect as PG unique_ips = \ [ int ( x . split ( ',' )[ 1 ]) for x in open ( 'unique_ips' , 'r+b' ) \ . read () \ . strip () \ . splitlines ()] conn = PG ( database = 'ip_ranges' ) cursor = conn . cursor () cursor . execute ( 'SET enable_seqscan = off;' ) cursor . execute ( ''' PREPARE find_stub AS SELECT domain_stub FROM ip_ranges WHERE "start" <= $1 AND "end" >= $1 LIMIT 1;''' ) sql = 'EXECUTE find_stub( %(ip)s );' for ip in unique_ips : cursor . execute ( sql , { 'ip' : ip }) resp = cursor . fetchone () assert len ( resp [ 0 ] . strip ()) > 0 cursor . execute ( '''DEALLOCATE find_stub;''' ) $ time python run_pg.py The above completed in 3 minutes and 41 seconds giving a lookup rate of 16,290/hour.

Populating ClickHouse I'll first load the CSV data into a Log Engine table in ClickHouse. $ clickhouse-client CREATE TABLE ip_ranges_log ( domain_stub String , start UInt32 , end UInt32 ) ENGINE = Log ; $ cat shrunken.csv \ | clickhouse-client \ --query = "INSERT INTO ip_ranges_log FORMAT CSV" I'll then produce a MergeTree Engine table which will convert the row-oriented data from the previous table into a form that will be faster to search against. The MergeTree Engine demands a date to partition the data against so I've picked a place holder date of January 1st, 1970. $ clickhouse-client CREATE TABLE ip_ranges ENGINE = MergeTree ( const_date , ( start , end ), 8192 ) AS SELECT toDate ( '1970-01-01' ) AS const_date , domain_stub , start , end FROM ip_ranges_log ; The Log Engine table is 166 MB in ClickHouse's internal format and the MergeTree Engine table is 283 MB. I couldn't find an equivalent of PostgreSQL's prepared statements in ClickHouse so the following will simply execute a SELECT statement in full for each and every record. $ vi run_ch.py from clickhouse_driver import Client as CH unique_ips = \ [ int ( x . split ( ',' )[ 1 ]) for x in open ( 'unique_ips' , 'r+b' ) \ . read () \ . strip () \ . splitlines ()] client = CH ( 'localhost' , port = 9000 , password = 'root' , secure = False , verify = False , compression = False ) sql = '''SELECT domain_stub FROM ip_ranges WHERE start <= %d AND end >= %d LIMIT 1''' for ip in unique_ips : resp = client . execute ( sql % ( ip , ip ))[ 0 ] assert len ( resp [ 0 ] . strip ()) > 0 $ time python run_ch.py The above completed in 9.19 seconds giving a lookup rate of just under 392K/hour. This is 24x faster than PostgreSQL.