As browser plugins that block JavaScript-based tracking beacons now enjoy a 9-figure user base, web traffic logs can be a good place to get a better feel for how many people are visiting your website. But anyone that has monitored a web traffic log for more than a few minutes is aware there is an army of bots crawling websites. But being able to separate bot and human-generated traffic in web server logs can be challenging.

In this blog I'll walk through the steps I went through to build an IPv4 ownership and browser string-based bot detection script.

The code used in this blog can be found in this gist.

IP Address Ownership Databases I'll first install Python and some dependencies. The following was run on a fresh Ubuntu 14.04.3 LTS installation. $ sudo apt update $ sudo apt install \ python-dev \ python-pip \ python-virtualenv I'll then create a Python virtual environment and activate it. This should ease any issues with permissions when installing libraries via pip. $ virtualenv findbots $ source findbots/bin/activate MaxMind offer a free database of country and city registration information for IPv4 addresses. Along with this dataset they've released a Python-based library called "geoip2" that can map their datasets to memory-mapped files and use a C-based Python extension to perform very fast lookups. The following will install their library and download and unpack their city-level dataset. $ pip install geoip2 $ curl -O http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.mmdb.gz $ gunzip GeoLite2-City.mmdb.gz I had a look at some web traffic logs and grep'ed out hits where "robots.txt" was being requested. From that list I spot-checked some of the more frequently-appearing IP addresses and found a number of hosting and cloud providers being listed as the owners of these IPs. I wanted to see if it was possible to put together a list, however incomplete, of IPv4 addresses under these providers' ownership. Google have a DNS-based mechanism for collecting a list of their IP addresses they use for their cloud offering. This first call will give you a list of hosts to query. $ dig -t txt _cloud-netblocks.googleusercontent.com | grep spf _cloud-netblocks.googleusercontent.com. 5 IN TXT "v=spf1 include:_cloud-netblocks1.googleusercontent.com include:_cloud-netblocks2.googleusercontent.com include:_cloud-netblocks3.googleusercontent.com include:_cloud-netblocks4.googleusercontent.com include:_cloud-netblocks5.googleusercontent.com ?all" The above states that _cloud-netblocks[1-5].googleusercontent.com will contain SPF records that contain IPv4 and IPv6 CIDR addresses they use. Querying all five addresses like the following should give you an up-to-date listing. $ dig -t txt _cloud-netblocks1.googleusercontent.com | grep spf _cloud-netblocks1.googleusercontent.com. 5 IN TXT "v=spf1 ip4:8.34.208.0/20 ip4:8.35.192.0/21 ip4:8.35.200.0/23 ip4:108.59.80.0/20 ip4:108.170.192.0/20 ip4:108.170.208.0/21 ip4:108.170.216.0/22 ip4:108.170.220.0/23 ip4:108.170.222.0/24 ?all" I published a blog post last March where I attempted to scrape WHOIS details for the entirety of the IPv4 address space using a Hadoop-based MapReduce job. The job itself ran for about two hours before terminating prematurely. I was left with an incomplete but still sizeable dataset of 235,532 WHOIS records. The dataset is a year old now but should still prove valuable, if not somewhat dated. $ ls -l -rw-rw-r-- 1 mark mark 5946203 Mar 31 2016 part-00001 -rw-rw-r-- 1 mark mark 5887326 Mar 31 2016 part-00002 ... -rw-rw-r-- 1 mark mark 6187219 Mar 31 2016 part-00154 -rw-rw-r-- 1 mark mark 5961162 Mar 31 2016 part-00155 When I spot-checked the IP ownership of bots hitting "robots.txt", in addition to Google, six firms came up a lot: Amazon, Baidu, Digital Ocean, Hetzner, Linode and New Dream Network. I ran the following commands to try and pick out their IPv4 WHOIS records. $ grep -i 'amazon' part-00* > amzn $ grep -i 'baidu' part-00* > baidu $ grep -i 'digital ocean' part-00* > digital_ocean $ grep -i 'hetzner' part-00* > hetzner $ grep -i 'linode' part-00* > linode $ grep -i 'new dream network' part-00* > dream I had to parse out double-encoded JSON strings that were embedded with file name and frequency count information from the above six files. I use the following code in iPython to get the distinctive CIDR blocks. import json def parse_cidrs ( filename ): recs = [] for line in open ( filename , 'r+b' ): try : recs . append ( json . loads ( json . loads ( ':' . join ( line . split ( ' \t ' )[ 0 ] . split ( ':' )[ 1 :])))) except ValueError : continue return set ([ str ( rec . get ( 'network' , {}) . get ( 'cidr' , None )) for rec in recs ]) for _name in [ 'amzn' , 'baidu' , 'digital_ocean' , 'hetzner' , 'linode' , 'dream' ]: print _name , parse_cidrs ( _name ) Here is an example WHOIS record once it's been cleaned up. I've truncated out the contact information. { "asn" : "38365" , "asn_cidr" : "182.61.0.0/18" , "asn_country_code" : "CN" , "asn_date" : "2010-02-25" , "asn_registry" : "apnic" , "entities" : [ "IRT-CNNIC-CN" , "SD753-AP" ], "network" : { "cidr" : "182.61.0.0/16" , "country" : "CN" , "end_address" : "182.61.255.255" , "events" : [ { "action" : "last changed" , "actor" : null , "timestamp" : "2014-09-28T05:44:22Z" } ], "handle" : "182.61.0.0 - 182.61.255.255" , "ip_version" : "v4" , "links" : [ "http://rdap.apnic.net/ip/182.0.0.0/8" , "http://rdap.apnic.net/ip/182.61.0.0/16" ], "name" : "Baidu" , "parent_handle" : "182.0.0.0 - 182.255.255.255" , "raw" : null , "remarks" : [ { "description" : "Beijing Baidu Netcom Science and Technology Co., Ltd..." , "links" : null , "title" : "description" } ], "start_address" : "182.61.0.0" , "status" : null , "type" : "ALLOCATED PORTABLE" }, "query" : "182.61.48.129" , "raw" : null } The list of seven firms isn't an exhaustive list of where bot traffic originates from. I found a lot of bot traffic from residential IPs in the Ukraine, Chinese IPs where the originating organisation is difficult to distinguish in addition to a distributed army of bots connecting from all around the world. If I wanted an exhaustive list of IPs used by bots I could look into HTTP header order, examine TCP/IP behaviour, hunt down forged IP registrations (see page 28), the list goes on and it's a bit of a cat-and-mouse game to be honest.

Installing Libraries For this project I'll be using a number of well-written libraries. Apache Log Parser can parse lines in Apache and Nginx-generated traffic logs. The library supports parsing over 30 different types of information from log files and I've found it remarkably flexible and reliable. Python User Agents can parse user agent strings as well as perform some basic classification of the agent being used. Colorama assists in creating colourful ANSI output. Netaddr is a mature and well-maintained network address manipulation library. $ pip install -e git+https://github.com/rory/apache-log-parser.git#egg = apache-log-parser \ -e git+https://github.com/selwin/python-user-agents.git#egg = python-user-agents \ colorama \ netaddr

The Bot Monitoring Script The following walks through the contents of monitor.py . This script accepts web traffic logs piped in from stdin. This means you can tail a log on a remote server via ssh and run this script locally. I'll first import two libraries from the Python Standard Library and the five external libraries installed via pip. import sys from urlparse import urlparse import apache_log_parser from colorama import Back , Style import geoip2.database from netaddr import IPNetwork , IPAddress from user_agents import parse In the following I've setup MaxMind's geoip2 library to use the "GeoLite2-City.mmdb" city-level library. I've also setup apache_log_parser to work with the format my web logs are being stored in. Your log format may vary so please take the time to compare your web server's traffic logging configuration against the library's format documentation. Finally, I have a dictionary of the CIDR blocks I found being owned by the seven firms. Included in this list is Baidu, which isn't a hosting or cloud provider per-se but nonetheless run bots that haven't always identified themselves by their user agent. reader = geoip2 . database . Reader ( 'GeoLite2-City.mmdb' ) _format = "%h %l %u %t \" %r \" %>s %b \" % {Referer} i \" \" %{User-Agent}i \" " line_parser = apache_log_parser . make_parser ( _format ) CIDRS = { 'Amazon' : [ '107.20.0.0/14' , '122.248.192.0/19' , '122.248.224.0/19' , '172.96.96.0/20' , '174.129.0.0/16' , '175.41.128.0/19' , '175.41.160.0/19' , '175.41.192.0/19' , '175.41.224.0/19' , '176.32.120.0/22' , '176.32.72.0/21' , '176.34.0.0/16' , '176.34.144.0/21' , '176.34.224.0/21' , '184.169.128.0/17' , '184.72.0.0/15' , '185.48.120.0/26' , '207.171.160.0/19' , '213.71.132.192/28' , '216.182.224.0/20' , '23.20.0.0/14' , '46.137.0.0/17' , '46.137.128.0/18' , '46.51.128.0/18' , '46.51.192.0/20' , '50.112.0.0/16' , '50.16.0.0/14' , '52.0.0.0/11' , '52.192.0.0/11' , '52.192.0.0/15' , '52.196.0.0/14' , '52.208.0.0/13' , '52.220.0.0/15' , '52.28.0.0/16' , '52.32.0.0/11' , '52.48.0.0/14' , '52.64.0.0/12' , '52.67.0.0/16' , '52.68.0.0/15' , '52.79.0.0/16' , '52.80.0.0/14' , '52.84.0.0/14' , '52.88.0.0/13' , '54.144.0.0/12' , '54.160.0.0/12' , '54.176.0.0/12' , '54.184.0.0/14' , '54.188.0.0/14' , '54.192.0.0/16' , '54.193.0.0/16' , '54.194.0.0/15' , '54.196.0.0/15' , '54.198.0.0/16' , '54.199.0.0/16' , '54.200.0.0/14' , '54.204.0.0/15' , '54.206.0.0/16' , '54.207.0.0/16' , '54.208.0.0/15' , '54.210.0.0/15' , '54.212.0.0/15' , '54.214.0.0/16' , '54.215.0.0/16' , '54.216.0.0/15' , '54.218.0.0/16' , '54.219.0.0/16' , '54.220.0.0/16' , '54.221.0.0/16' , '54.224.0.0/12' , '54.228.0.0/15' , '54.230.0.0/15' , '54.232.0.0/16' , '54.234.0.0/15' , '54.236.0.0/15' , '54.238.0.0/16' , '54.239.0.0/17' , '54.240.0.0/12' , '54.242.0.0/15' , '54.244.0.0/16' , '54.245.0.0/16' , '54.247.0.0/16' , '54.248.0.0/15' , '54.250.0.0/16' , '54.251.0.0/16' , '54.252.0.0/16' , '54.253.0.0/16' , '54.254.0.0/16' , '54.255.0.0/16' , '54.64.0.0/13' , '54.72.0.0/13' , '54.80.0.0/12' , '54.72.0.0/15' , '54.79.0.0/16' , '54.88.0.0/16' , '54.93.0.0/16' , '54.94.0.0/16' , '63.173.96.0/24' , '72.21.192.0/19' , '75.101.128.0/17' , '79.125.64.0/18' , '96.127.0.0/17' ], 'Baidu' : [ '180.76.0.0/16' , '119.63.192.0/21' , '106.12.0.0/15' , '182.61.0.0/16' ], 'DO' : [ '104.131.0.0/16' , '104.236.0.0/16' , '107.170.0.0/16' , '128.199.0.0/16' , '138.197.0.0/16' , '138.68.0.0/16' , '139.59.0.0/16' , '146.185.128.0/21' , '159.203.0.0/16' , '162.243.0.0/16' , '178.62.0.0/17' , '178.62.128.0/17' , '188.166.0.0/16' , '188.166.0.0/17' , '188.226.128.0/18' , '188.226.192.0/18' , '45.55.0.0/16' , '46.101.0.0/17' , '46.101.128.0/17' , '82.196.8.0/21' , '95.85.0.0/21' , '95.85.32.0/21' ], 'Dream' : [ '173.236.128.0/17' , '205.196.208.0/20' , '208.113.128.0/17' , '208.97.128.0/18' , '67.205.0.0/18' ], 'Google' : [ '104.154.0.0/15' , '104.196.0.0/14' , '107.167.160.0/19' , '107.178.192.0/18' , '108.170.192.0/20' , '108.170.208.0/21' , '108.170.216.0/22' , '108.170.220.0/23' , '108.170.222.0/24' , '108.59.80.0/20' , '130.211.128.0/17' , '130.211.16.0/20' , '130.211.32.0/19' , '130.211.4.0/22' , '130.211.64.0/18' , '130.211.8.0/21' , '146.148.16.0/20' , '146.148.2.0/23' , '146.148.32.0/19' , '146.148.4.0/22' , '146.148.64.0/18' , '146.148.8.0/21' , '162.216.148.0/22' , '162.222.176.0/21' , '173.255.112.0/20' , '192.158.28.0/22' , '199.192.112.0/22' , '199.223.232.0/22' , '199.223.236.0/23' , '208.68.108.0/23' , '23.236.48.0/20' , '23.251.128.0/19' , '35.184.0.0/14' , '35.188.0.0/15' , '35.190.0.0/17' , '35.190.128.0/18' , '35.190.192.0/19' , '35.190.224.0/20' , '8.34.208.0/20' , '8.35.192.0/21' , '8.35.200.0/23' ,], 'Hetzner' : [ '129.232.128.0/17' , '129.232.156.128/28' , '136.243.0.0/16' , '138.201.0.0/16' , '144.76.0.0/16' , '148.251.0.0/16' , '176.9.12.192/28' , '176.9.168.0/29' , '176.9.24.0/27' , '176.9.72.128/27' , '178.63.0.0/16' , '178.63.120.64/27' , '178.63.156.0/28' , '178.63.216.0/29' , '178.63.216.128/29' , '178.63.48.0/26' , '188.40.0.0/16' , '188.40.108.64/26' , '188.40.132.128/26' , '188.40.144.0/24' , '188.40.48.0/26' , '188.40.48.128/26' , '188.40.72.0/26' , '196.40.108.64/29' , '213.133.96.0/20' , '213.239.192.0/18' , '41.203.0.128/27' , '41.72.144.192/29' , '46.4.0.128/28' , '46.4.192.192/29' , '46.4.84.128/27' , '46.4.84.64/27' , '5.9.144.0/27' , '5.9.192.128/27' , '5.9.240.192/27' , '5.9.252.64/28' , '78.46.0.0/15' , '78.46.24.192/29' , '78.46.64.0/19' , '85.10.192.0/20' , '85.10.228.128/29' , '88.198.0.0/16' , '88.198.0.0/20' ], 'Linode' : [ '104.200.16.0/20' , '109.237.24.0/22' , '139.162.0.0/16' , '172.104.0.0/15' , '173.255.192.0/18' , '178.79.128.0/21' , '198.58.96.0/19' , '23.92.16.0/20' , '45.33.0.0/17' , '45.56.64.0/18' , '45.79.0.0/16' , '50.116.0.0/18' , '80.85.84.0/23' , '96.126.96.0/19' ], } I've created a utility function where I can pass in an IPv4 address and a list of CIDR blocks and it'll tell me if the IP address belongs to any of the given CIDR blocks. def in_block ( ip , block ): _ip = IPAddress ( ip ) return any ([ True for cidr in block if _ip in IPNetwork ( cidr )]) The following function takes objects of a request and the browser agent used and tries to determine if the source of traffic and/or the browser agent are that of a bot. The browser agent object is put together by the Python User Agents library and it already has some tests for determining if a user agent string is that of a known bot. I've further expanded these tests with a number of tokens I saw slip through the library's classification system. I also iterate through the CIDR blocks to see if the the remote host's IPv4 address is within any of them. def bot_test ( req , agent ): ua_tokens = [ 'daum/' , # Daum Communications Corp. 'gigablastopensource' , 'go-http-client' , 'http://' , 'httpclient' , 'https://' , 'libwww-perl' , 'phantomjs' , 'proxy' , 'python' , 'sitesucker' , 'wada.vn' , 'webindex' , 'wget' ] is_bot = agent . is_bot or \ any ([ True for cidr in CIDRS . values () if in_block ( req [ 'remote_host' ], cidr )]) or \ any ([ True for token in ua_tokens if token in agent . ua_string . lower ()]) return is_bot The following is the main section of the script. Here web traffic logs are read in from stdin line by line. Each line of content is parsed for a tokenised version of the request, user agent and URI being requested. These objects make it easier to work with the data without the complexity of having to parse them on the fly. I attempt to look up the city and country associated with the IPv4 address using MaxMind's library. If there is any sort of failure these are simply set to None. After the bot test I prepare the output. If the request is seen to be that of a bot it'll highlight the output with a red background. if __name__ == '__main__' : while True : try : line = sys . stdin . readline () except KeyboardInterrupt : break if not line : break req = line_parser ( line ) agent = parse ( req [ 'request_header_user_agent' ]) uri = urlparse ( req [ 'request_url' ]) try : response = reader . city ( req [ 'remote_host' ]) country , city = response . country . iso_code , response . city . name except : country , city = None , None is_bot = bot_test ( req , agent ) agent_str = ', ' . join ([ item for item in agent . browser [ 0 : 3 ] + agent . device [ 0 : 3 ] + agent . os [ 0 : 3 ] if item is not None and type ( item ) is not tuple and len ( item . strip ()) and item != 'Other' ]) ip_owner_str = ' ' . join ([ network + ' IP' for network , cidr in CIDRS . iteritems () if in_block ( req [ 'remote_host' ], cidr )]) print Back . RED + 'b' if is_bot else 'h' , \ country , \ city , \ uri . path , \ agent_str , \ ip_owner_str , \ Style . RESET_ALL