UPDATE: Since writing this blog post I've had other developers get in touch with ideas of improving the script described in this post. I've create a repo on github where pull requests can be submitted.

I recently published a blog post on finding the fastest way to lookup the country mapping of any given IP Address. Within a day I found some interesting insights made by gigarray on /r/Python:

"MaxMind's data is widely known to be fairly garbage." followed by "converting the IP address ... to an assigned block and doing a single lookup for that block ... it converts 4 billion IPv4 addresses down to 65K ASNs"

With that I thought "How hard could it be to scrape WHOIS databases for all known IPv4 addresses?" if I only needed to make a little more than 65,000 WHOIS queries then it shouldn't take long to see the world's IPv4 mappings and have some interesting data to analyse.

Making the problem ~66,153 times smaller When querying for an IP address often you get back the net range the IP address sits in: $ whois 24 .0.0.0 ... NetRange: 24 .0.0.0 - 24 .15.255.255 When you see that 24.0.0.0 - 24.15.255.255 it the net range then you can make your next query for 24.16.0.0 instead of 24.0.0.1.

Parsing WHOIS records WHOIS records look uniform but there are many differences between them. $ whois 24 .0.0.0 ... NetRange: 24 .0.0.0 - 24 .15.255.255 CIDR: 24 .0.0.0/12 NetName: EASTERNSHORE-1 NetHandle: NET-24-0-0-0-1 Parent: NET24 ( NET-24-0-0-0-0 ) NetType: Direct Allocation OriginAS: Organization: Comcast Cable Communications, Inc. ( CMCS ) RegDate: 2003 -10-06 Updated: 2012 -03-02 Comment: ADDRESSES WITHIN THIS BLOCK ARE NON-PORTABLE Ref: http://whois.arin.net/rest/net/NET-24-0-0-0-1 ... So the first library I looked for was one that could perform the WHOIS query and parse it, ipwhois does just that. Looking at some of its code I could see it handled a lot of edge cases when parsing records and returned the result as a dictionary. In [ 1 ]: from ipwhois import IPWhois In [ 2 ]: IPWhois ( '24.24.24.24' ) . lookup_rws () Out [ 2 ]: { 'asn' : '11351' , 'asn_cidr' : '24.24.0.0/18' , 'asn_country_code' : 'US' , 'asn_date' : '2000-06-09' , 'asn_registry' : 'arin' , 'nets' : [{ 'abuse_emails' : 'abuse@rr.com' , 'address' : '13820 Sunrise Valley Dr' , 'cidr' : '24.24.0.0/14, 24.28.0.0/15' , 'city' : 'Herndon' , 'country' : 'US' , 'created' : '2000-06-09T00:00:00-04:00' , 'description' : 'Time Warner Cable Internet LLC' , 'handle' : u 'NET-24-24-0-0-1' , 'misc_emails' : None , 'name' : 'ROAD-RUNNER-1' , 'postal_code' : '20171' , 'range' : u '24.24.0.0 - 24.29.255.255' , 'state' : 'VA' , 'tech_emails' : 'abuse@rr.com' , 'updated' : '2011-07-06T16:44:52-04:00' }], 'query' : '24.24.24.24' , 'raw' : None }

4.3 billion addresses but not all are for public use Next I needed to make sure I didn't waste queries on IP ranges that would never return a proper result. 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 are well known for being private network addresses, 127.0.0.0/8 is for loop back, 224.0.0.0/4 is for multicast, the first and last 256 addresses in the 169.254/16 block are 'reserved for the future', the list goes on... I found the ipaddr.py module had a decent list of defined networks so I built my list from there: import socket import struct import ipcalc def get_next_ip ( ip_address ): """ :param str ip_address: ipv4 address :return: next ipv4 address :rtype: str >>> get_next_ip('0.0.0.0') '0.0.0.1' >>> get_next_ip('24.24.24.24') '24.24.24.25' >>> get_next_ip('24.24.255.255') '24.25.0.0' >>> get_next_ip('255.255.255.255') is None True """ assert ip_address . count ( '.' ) == 3 , \ 'Must be an IPv4 address in str representation' if ip_address == '255.255.255.255' : return None try : return socket . inet_ntoa ( struct . pack ( '!L' , ip2long ( ip_address ) + 1 )) except Exception , error : print 'Unable to get next IP for %s ' % ip_address raise error def get_next_undefined_address ( ip ): """ Get the next non-private IPv4 address if the address sent is private :param str ip: IPv4 address :return: ipv4 address of net non-private address :rtype: str >>> get_next_undefined_address('0.0.0.0') '1.0.0.0' >>> get_next_undefined_address('24.24.24.24') '24.24.24.24' >>> get_next_undefined_address('127.0.0.1') '128.0.0.0' >>> get_next_undefined_address('255.255.255.256') is None True """ try : # Should weed out many invalid IP addresses ipcalc . Network ( ip ) except ValueError , error : return None defined_networks = ( '0.0.0.0/8' , '10.0.0.0/8' , '127.0.0.0/8' , '169.254.0.0/16' , '192.0.0.0/24' , '192.0.2.0/24' , '192.88.99.0/24' , '192.168.0.0/16' , '198.18.0.0/15' , '198.51.100.0/24' , '203.0.113.0/24' , '224.0.0.0/4' , '240.0.0.0/4' , '255.255.255.255/32' , ) for network_cidr in defined_networks : if ip in ipcalc . Network ( network_cidr ): return get_next_ip ( get_netrange_end ( network_cidr )) return ip Now I could start from 0.0.0.0 and work my way to 255.255.255.255. But before I query each ip address I check to see if it's defined and if it is, get back the next undefined address: >>> get_next_undefined_address ( '0.0.0.0' ) '1.0.0.0' The ipcalc module came in handy when trying to see if an IP address was within a CIDR-defined range.

How many IPs are unassigned? One problem I came up against is when I found a range of IP addresses that were unassigned I wasn't told how large the range was. So when I hit 192.0.1.0 I just had to keep going through IP addresses one at a time till I found another IP address again. This took a long time and felt unproductive. $ whois 192 .0.1.0 # # ARIN WHOIS data and services are subject to the Terms of Use # available at: https://www.arin.net/whois_tou.html # # If you see inaccuracies in the results, please report at # http://www.arin.net/public/whoisinaccuracy/index.xhtml # No match found for n + 192 .0.1.0. # # ARIN WHOIS data and services are subject to the Terms of Use # available at: https://www.arin.net/whois_tou.html # # If you see inaccuracies in the results, please report at # http://www.arin.net/public/whoisinaccuracy/index.xhtml # My script would just print out the following and try the next IP address: Missing ASN CIDR in whois resp : { 'asn_registry' : 'arin' , 'asn_date' : '' , 'asn_country_code' : '' , 'raw' : None , 'asn_cidr' : 'NA' , 'query' : '192.0.1.86' , 'nets' : [], 'asn' : 'NA' } To minimise the amount of time I spent where no data was being collected I decided to break up the job. My script accepts a number of threads and divides up the IPv4 address space into equal chunks for each thread to handle. def break_up_ipv4_address_space ( num_threads = 8 ): """ >>> break_up_ipv4_address_space() == \ [('0.0.0.0', '31.255.255.255'), ('32.0.0.0', '63.255.255.255'),\ ('64.0.0.0', '95.255.255.255'), ('96.0.0.0', '127.255.255.255'),\ ('128.0.0.0', '159.255.255.255'), ('160.0.0.0', '191.255.255.255'),\ ('192.0.0.0', '223.255.255.255'), ('224.0.0.0', '255.255.255.255')] True """ ranges = [] multiplier = 256 / num_threads for marker in range ( 0 , num_threads ): starting_class_a = ( marker * multiplier ) ending_class_a = (( marker + 1 ) * multiplier ) - 1 ranges . append (( ' %d .0.0.0' % starting_class_a , ' %d .255.255.255' % ending_class_a )) return ranges gevent is used to launch each thread asynchronously: threads = [ gevent . spawn ( get_netranges , starting_id , ending_ip , ... ) for starting_id , ending_ip in get_ranges ( num_threads )] gevent . joinall ( threads )

What was collected? I stored the various items of information ipwhois returned in Elasticsearch along with the starting and ending ip address for each range and the number of addresses within each range. I then created a small method to show (up to) the top 10 countries and cities by number of IP addresses assigned to networks within them along with the number of respective IP addresses. I didn't completely run the scraping through the whole of the IPv4 space as this was just an experiment. The following are from just a few minutes of data that I collected: $ ./whois.py stats http://127.0.0.1:9200/ netblocks Top 10 netblock locations by country 67 ,836,672 us 327 ,680 eu 73 ,728 ca 65 ,536 gb 65 ,536 ie 32 ,768 th 20 ,480 jp 15 ,872 cn 6 ,656 ro 2 ,048 dk Top 10 netblock locations by city 16 ,842,752 columbus 16 ,785,408 houston 16 ,777,216 lake mary 16 ,252,928 ann arbor 524 ,288 littleton 262 ,144 herdon 131 ,072 nashville 131 ,072 sioux falls 65 ,536 toronto 61 ,184 spanish fork

There probably is a better way of doing this I wouldn't be surprised if someone does provide a data dump of all WHOIS records for the IPv4 space somewhere online. For 2014, this seems like a lot of effort just to see the state of IPv4 assignments. There were a lot of edge cases I came up against and I'm in deep admiration to anyone who can scrape this data consistently quickly and reliably. I time-boxed my efforts on this code to one day so its far from a shining example of what I can do when I'm at my best. I welcome any feedback or suggestions on the code.