One of my Automating OSINT students Michael Rossi (@RossiMI01) pinged me with an interesting challenge. He had mentioned that the Common Crawl project is an excellent source of OSINT, as you can begin to explore any page snapshots they have stored for a target domain. Michael wanted to take this a step further and mine out all external links from the returned HTML. This can enable you to find relationships between one domain and another, and of course potentially discover links to social media accounts or other websites that might be of interest. This blog post walks you through how to approach this problem so that you can automate the retrieval of external links from a target domain that has been stored in Common Crawl. Here we go!

What is Common Crawl?

Common Crawl is a gigantic dataset that is created by crawling the web. They provide the data in both downloadable format (gigantic) or you can query against their indices and only retrieve back the information you are after. It is also 100% free, which makes it even more awesome.

They provide an API that will allow you to query a particular index (indexes are snapshots for each crawling run they perform) for a particular domain, and it will return back results that point you to the location of where the actual HTML content for that snapshot lives. The API documentation can be found here.

Once the API returns back the results you are looking for, you need to then reach into the compressed archive files stored on Amazon S3 and pull out the actual content. This is where Stephen Merity (@smerity) came to my rescue. He posted some example code here that demonstrated how to retrieve items from the archived files on S3. Beauty, let’s get started.

Coding It Up

First off you need to install a couple of Python modules (if you don’t know how to do that check out my tutorials here):

pip install requests bs4

Now let’s crack open a new file and call it commoncrawler.py and punch out the following code (you can download the source here):

import requests import argparse import time import json import StringIO import gzip import csv import codecs from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf8') # parse the command line arguments ap = argparse.ArgumentParser() ap.add_argument("-d","--domain",required=True,help="The domain to target ie. cnn.com") args = vars(ap.parse_args()) domain = args['domain'] # list of available indices index_list = ["2014-52","2015-06","2015-11","2015-14","2015-18","2015-22","2015-27"] # # Searches the Common Crawl Index for a domain. # def search_domain(domain): record_list = [] print "[*] Trying target domain: %s" % domain for index in index_list: print "[*] Trying index %s" % index cc_url = "http://index.commoncrawl.org/CC-MAIN-%s-index?" % index cc_url += "url=%s&matchType=domain&output=json" % domain response = requests.get(cc_url) if response.status_code == 200: records = response.content.splitlines() for record in records: record_list.append(json.loads(record)) print "[*] Added %d results." % len(records) print "[*] Found a total of %d hits." % len(record_list) return record_list # # Downloads a page from Common Crawl - adapted graciously from @Smerity - thanks man! # https://gist.github.com/Smerity/56bc6f21a8adec920ebf # def download_page(record): offset, length = int(record['offset']), int(record['length']) offset_end = offset + length - 1 # We'll get the file via HTTPS so we don't need to worry about S3 credentials # Getting the file on S3 is equivalent however - you can request a Range prefix = 'https://aws-publicdatasets.s3.amazonaws.com/' # We can then use the Range header to ask for just this set of bytes resp = requests.get(prefix + record['filename'], headers={'Range': 'bytes={}-{}'.format(offset, offset_end)}) # The page is stored compressed (gzip) to save space # We can extract it using the GZIP library raw_data = StringIO.StringIO(resp.content) f = gzip.GzipFile(fileobj=raw_data) # What we have now is just the WARC response, formatted: data = f.read() response = "" if len(data): try: warc, header, response = data.strip().split('\r

\r

', 2) except: pass return response # # Extract links from the HTML # def extract_external_links(html_content,link_list): parser = BeautifulSoup(html_content) links = parser.find_all("a") if links: for link in links: href = link.attrs.get("href") if href is not None: if domain not in href: if href not in link_list and href.startswith("http"): print "[*] Discovered external link: %s" % href link_list.append(href) return link_list record_list = search_domain(domain) link_list = [] for record in record_list: html_content = download_page(record) print "[*] Retrieved %d bytes for %s" % (len(html_content),record['url']) link_list = extract_external_links(html_content,link_list) print "[*] Total external links discovered: %d" % len(link_list) with codecs.open("%s-links.csv","wb",encoding="utf-8") as output: fields = ["URL"] logger = csv.DictWriter(output,fieldnames=fields) logger.writeheader() for link in link_list: logger.writerow({"URL":link}) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import requests import argparse import time import json import StringIO import gzip import csv import codecs from bs4 import BeautifulSoup import sys reload ( sys ) sys . setdefaultencoding ( 'utf8' ) # parse the command line arguments ap = argparse . ArgumentParser ( ) ap . add_argument ( "-d" , "--domain" , required = True , help = "The domain to target ie. cnn.com" ) args = vars ( ap . parse_args ( ) ) domain = args [ 'domain' ] # list of available indices index_list = [ "2014-52" , "2015-06" , "2015-11" , "2015-14" , "2015-18" , "2015-22" , "2015-27" ]

Let’s look at what this first bit of code does:

Lines 12-14: this is a hack to force the use of UTF-8 character encoding to keep the csv module happy when we are outputting results to our spreadsheet.

this is a hack to force the use of UTF-8 character encoding to keep the module happy when we are outputting results to our spreadsheet. Lines 16-21: here we are just parsing out our command line arguments and storing the result in our domain variable.

here we are just parsing out our command line arguments and storing the result in our domain variable. Line 24: this is a list of all of the Common Crawl indices that we can query for snapshots of the target domain.

Alright now let’s put the function in that will deal with making queries to the Common Crawl API and handling the results. Add the following code to your script:

import requests import argparse import time import json import StringIO import gzip import csv import codecs from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf8') # parse the command line arguments ap = argparse.ArgumentParser() ap.add_argument("-d","--domain",required=True,help="The domain to target ie. cnn.com") args = vars(ap.parse_args()) domain = args['domain'] # list of available indices index_list = ["2014-52","2015-06","2015-11","2015-14","2015-18","2015-22","2015-27"] # # Searches the Common Crawl Index for a domain. # def search_domain(domain): record_list = [] print "[*] Trying target domain: %s" % domain for index in index_list: print "[*] Trying index %s" % index cc_url = "http://index.commoncrawl.org/CC-MAIN-%s-index?" % index cc_url += "url=%s&matchType=domain&output=json" % domain response = requests.get(cc_url) if response.status_code == 200: records = response.content.splitlines() for record in records: record_list.append(json.loads(record)) print "[*] Added %d results." % len(records) print "[*] Found a total of %d hits." % len(record_list) return record_list # # Downloads a page from Common Crawl - adapted graciously from @Smerity - thanks man! # https://gist.github.com/Smerity/56bc6f21a8adec920ebf # def download_page(record): offset, length = int(record['offset']), int(record['length']) offset_end = offset + length - 1 # We'll get the file via HTTPS so we don't need to worry about S3 credentials # Getting the file on S3 is equivalent however - you can request a Range prefix = 'https://aws-publicdatasets.s3.amazonaws.com/' # We can then use the Range header to ask for just this set of bytes resp = requests.get(prefix + record['filename'], headers={'Range': 'bytes={}-{}'.format(offset, offset_end)}) # The page is stored compressed (gzip) to save space # We can extract it using the GZIP library raw_data = StringIO.StringIO(resp.content) f = gzip.GzipFile(fileobj=raw_data) # What we have now is just the WARC response, formatted: data = f.read() response = "" if len(data): try: warc, header, response = data.strip().split('\r

\r

', 2) except: pass return response # # Extract links from the HTML # def extract_external_links(html_content,link_list): parser = BeautifulSoup(html_content) links = parser.find_all("a") if links: for link in links: href = link.attrs.get("href") if href is not None: if domain not in href: if href not in link_list and href.startswith("http"): print "[*] Discovered external link: %s" % href link_list.append(href) return link_list record_list = search_domain(domain) link_list = [] for record in record_list: html_content = download_page(record) print "[*] Retrieved %d bytes for %s" % (len(html_content),record['url']) link_list = extract_external_links(html_content,link_list) print "[*] Total external links discovered: %d" % len(link_list) with codecs.open("%s-links.csv","wb",encoding="utf-8") as output: fields = ["URL"] logger = csv.DictWriter(output,fieldnames=fields) logger.writeheader() for link in link_list: logger.writerow({"URL":link}) 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 # # Searches the Common Crawl Index for a domain. # def search_domain ( domain ) : record_list = [ ] print "[*] Trying target domain: %s" % domain for index in index_list : print "[*] Trying index %s" % index cc_url = "http://index.commoncrawl.org/CC-MAIN-%s-index?" % index cc_url += "url=%s&matchType=domain&output=json" % domain response = requests . get ( cc_url ) if response . status_code == 200 : records = response . content . splitlines ( ) for record in records : record_list . append ( json . loads ( record ) ) print "[*] Added %d results." % len ( records ) print "[*] Found a total of %d hits." % len ( record_list ) return record_list

Ok so our searching function is complete, let’s look at the important parts:

Lines 35-40: here we are iterating over the available indices (35) and then building a search URL (39-40) that we’ll hit to retrieve the results.

here we are iterating over the available indices (35) and then building a search URL (39-40) that we’ll hit to retrieve the results. Lines 46-51: as the response is multiple chunks of JSON data, we have to split the lines (46), and then parse each line (48,49). We finish by printing out a friendly message to let us know if we had any hits (51).

When the function is finished it returns our full list of results that we can then use to retrieve the actual data from the index. Let’s implement that function now:

import requests import argparse import time import json import StringIO import gzip import csv import codecs from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf8') # parse the command line arguments ap = argparse.ArgumentParser() ap.add_argument("-d","--domain",required=True,help="The domain to target ie. cnn.com") args = vars(ap.parse_args()) domain = args['domain'] # list of available indices index_list = ["2014-52","2015-06","2015-11","2015-14","2015-18","2015-22","2015-27"] # # Searches the Common Crawl Index for a domain. # def search_domain(domain): record_list = [] print "[*] Trying target domain: %s" % domain for index in index_list: print "[*] Trying index %s" % index cc_url = "http://index.commoncrawl.org/CC-MAIN-%s-index?" % index cc_url += "url=%s&matchType=domain&output=json" % domain response = requests.get(cc_url) if response.status_code == 200: records = response.content.splitlines() for record in records: record_list.append(json.loads(record)) print "[*] Added %d results." % len(records) print "[*] Found a total of %d hits." % len(record_list) return record_list # # Downloads a page from Common Crawl - adapted graciously from @Smerity - thanks man! # https://gist.github.com/Smerity/56bc6f21a8adec920ebf # def download_page(record): offset, length = int(record['offset']), int(record['length']) offset_end = offset + length - 1 # We'll get the file via HTTPS so we don't need to worry about S3 credentials # Getting the file on S3 is equivalent however - you can request a Range prefix = 'https://aws-publicdatasets.s3.amazonaws.com/' # We can then use the Range header to ask for just this set of bytes resp = requests.get(prefix + record['filename'], headers={'Range': 'bytes={}-{}'.format(offset, offset_end)}) # The page is stored compressed (gzip) to save space # We can extract it using the GZIP library raw_data = StringIO.StringIO(resp.content) f = gzip.GzipFile(fileobj=raw_data) # What we have now is just the WARC response, formatted: data = f.read() response = "" if len(data): try: warc, header, response = data.strip().split('\r

\r

', 2) except: pass return response # # Extract links from the HTML # def extract_external_links(html_content,link_list): parser = BeautifulSoup(html_content) links = parser.find_all("a") if links: for link in links: href = link.attrs.get("href") if href is not None: if domain not in href: if href not in link_list and href.startswith("http"): print "[*] Discovered external link: %s" % href link_list.append(href) return link_list record_list = search_domain(domain) link_list = [] for record in record_list: html_content = download_page(record) print "[*] Retrieved %d bytes for %s" % (len(html_content),record['url']) link_list = extract_external_links(html_content,link_list) print "[*] Total external links discovered: %d" % len(link_list) with codecs.open("%s-links.csv","wb",encoding="utf-8") as output: fields = ["URL"] logger = csv.DictWriter(output,fieldnames=fields) logger.writeheader() for link in link_list: logger.writerow({"URL":link}) 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 # # Downloads a page from Common Crawl - adapted graciously from @Smerity - thanks man! # https://gist.github.com/Smerity/56bc6f21a8adec920ebf # def download_page ( record ) : offset , length = int ( record [ 'offset' ] ) , int ( record [ 'length' ] ) offset_end = offset + length - 1 # We'll get the file via HTTPS so we don't need to worry about S3 credentials # Getting the file on S3 is equivalent however - you can request a Range prefix = 'https://aws-publicdatasets.s3.amazonaws.com/' # We can then use the Range header to ask for just this set of bytes resp = requests . get ( prefix + record [ 'filename' ] , headers = { 'Range' : 'bytes={}-{}' . format ( offset , offset_end ) } ) # The page is stored compressed (gzip) to save space # We can extract it using the GZIP library raw_data = StringIO . StringIO ( resp . content ) f = gzip . GzipFile ( fileobj = raw_data ) # What we have now is just the WARC response, formatted: data = f . read ( ) response = "" if len ( data ) : try : warc , header , response = data . strip ( ) . split ( '\r

\r

' , 2 ) except : pass return response

This is borrowed heavily from Stephen’s code (thanks again Stephen). Let’s take a look at what he’s doing here:

Lines 64-65: we use two of the keys from our record variable that contains the offset and the length of the data stored in the compressed archive.

we use two of the keys from our record variable that contains the offset and the length of the data stored in the compressed archive. Lines 69-72: we build up a URL to Amazon’s S3 and utilize a HTTP range request to request the specific byte offset and length of the result in the compressed archive. This saves you from having to download then entire compressed file, brilliant move by Stephen again.

we build up a URL to Amazon’s S3 and utilize a HTTP range request to request the specific byte offset and length of the result in the compressed archive. This saves you from having to download then entire compressed file, brilliant move by Stephen again. Lines 76-80: here we use the StringIO module to get a file-like descriptor to the returned data (76) which we pass along to the gzip module (77) to decompress the data. Once the data is decompressed we read it out into our data variable (80).

here we use the StringIO module to get a file-like descriptor to the returned data (76) which we pass along to the module (77) to decompress the data. Once the data is decompressed we read it out into our data variable (80). Lines 82-90: now we split the data into three parts: the warc variable holds the metadata for the page in the WARC archive, the header variable has the HTTP headers retrieved when Common Crawl hit the target domain and the response variable contains the body of the HTML we want. If all is well we return the HTML data.

Now that we can search for a domain and extract the raw HTML from the Common Crawl indices, we need to extract all of the links. Let’s implement a function that will do just that:

import requests import argparse import time import json import StringIO import gzip import csv import codecs from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf8') # parse the command line arguments ap = argparse.ArgumentParser() ap.add_argument("-d","--domain",required=True,help="The domain to target ie. cnn.com") args = vars(ap.parse_args()) domain = args['domain'] # list of available indices index_list = ["2014-52","2015-06","2015-11","2015-14","2015-18","2015-22","2015-27"] # # Searches the Common Crawl Index for a domain. # def search_domain(domain): record_list = [] print "[*] Trying target domain: %s" % domain for index in index_list: print "[*] Trying index %s" % index cc_url = "http://index.commoncrawl.org/CC-MAIN-%s-index?" % index cc_url += "url=%s&matchType=domain&output=json" % domain response = requests.get(cc_url) if response.status_code == 200: records = response.content.splitlines() for record in records: record_list.append(json.loads(record)) print "[*] Added %d results." % len(records) print "[*] Found a total of %d hits." % len(record_list) return record_list # # Downloads a page from Common Crawl - adapted graciously from @Smerity - thanks man! # https://gist.github.com/Smerity/56bc6f21a8adec920ebf # def download_page(record): offset, length = int(record['offset']), int(record['length']) offset_end = offset + length - 1 # We'll get the file via HTTPS so we don't need to worry about S3 credentials # Getting the file on S3 is equivalent however - you can request a Range prefix = 'https://aws-publicdatasets.s3.amazonaws.com/' # We can then use the Range header to ask for just this set of bytes resp = requests.get(prefix + record['filename'], headers={'Range': 'bytes={}-{}'.format(offset, offset_end)}) # The page is stored compressed (gzip) to save space # We can extract it using the GZIP library raw_data = StringIO.StringIO(resp.content) f = gzip.GzipFile(fileobj=raw_data) # What we have now is just the WARC response, formatted: data = f.read() response = "" if len(data): try: warc, header, response = data.strip().split('\r

\r

', 2) except: pass return response # # Extract links from the HTML # def extract_external_links(html_content,link_list): parser = BeautifulSoup(html_content) links = parser.find_all("a") if links: for link in links: href = link.attrs.get("href") if href is not None: if domain not in href: if href not in link_list and href.startswith("http"): print "[*] Discovered external link: %s" % href link_list.append(href) return link_list record_list = search_domain(domain) link_list = [] for record in record_list: html_content = download_page(record) print "[*] Retrieved %d bytes for %s" % (len(html_content),record['url']) link_list = extract_external_links(html_content,link_list) print "[*] Total external links discovered: %d" % len(link_list) with codecs.open("%s-links.csv","wb",encoding="utf-8") as output: fields = ["URL"] logger = csv.DictWriter(output,fieldnames=fields) logger.writeheader() for link in link_list: logger.writerow({"URL":link}) 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 # # Extract links from the HTML # def extract_external_links ( html_content , link_list ) : parser = BeautifulSoup ( html_content ) links = parser . find_all ( "a" ) if links : for link in links : href = link . attrs . get ( "href" ) if href is not None : if domain not in href : if href not in link_list and href . startswith ( "http" ) : print "[*] Discovered external link: %s" % href link_list . append ( href ) return link_list

Let’s have a peek at what this code does. We’re nearly there!

Line 95: we construct our function to retrieve the raw HTML and a list of links. The list of links will ensure that we aren’t storing duplicates of the same link, although if you wanted to use some visualization you could keep this in place and measure the weight of how commonly some sites are connected to your target domain. I’ll leave that to you for homework.

we construct our function to retrieve the raw HTML and a list of links. The list of links will ensure that we aren’t storing duplicates of the same link, although if you wanted to use some visualization you could keep this in place and measure the weight of how commonly some sites are connected to your target domain. I’ll leave that to you for homework. Lines 97-99: we pass the raw HTML content to BeautifulSoup (97) and then we ask it to parse out all of the links (99).

we pass the raw HTML content to BeautifulSoup (97) and then we ask it to parse out all of the links (99). Lines 103-111: now we iterate over the list of links (103) and pull out the href attribute (104). If the target domain is not in the link and it’s not already in our list of links we add it to our master link list and carry on.

Ok our main functions are in place, and now we can put the final code in place to actually run our functions and log the results.

import requests import argparse import time import json import StringIO import gzip import csv import codecs from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf8') # parse the command line arguments ap = argparse.ArgumentParser() ap.add_argument("-d","--domain",required=True,help="The domain to target ie. cnn.com") args = vars(ap.parse_args()) domain = args['domain'] # list of available indices index_list = ["2014-52","2015-06","2015-11","2015-14","2015-18","2015-22","2015-27"] # # Searches the Common Crawl Index for a domain. # def search_domain(domain): record_list = [] print "[*] Trying target domain: %s" % domain for index in index_list: print "[*] Trying index %s" % index cc_url = "http://index.commoncrawl.org/CC-MAIN-%s-index?" % index cc_url += "url=%s&matchType=domain&output=json" % domain response = requests.get(cc_url) if response.status_code == 200: records = response.content.splitlines() for record in records: record_list.append(json.loads(record)) print "[*] Added %d results." % len(records) print "[*] Found a total of %d hits." % len(record_list) return record_list # # Downloads a page from Common Crawl - adapted graciously from @Smerity - thanks man! # https://gist.github.com/Smerity/56bc6f21a8adec920ebf # def download_page(record): offset, length = int(record['offset']), int(record['length']) offset_end = offset + length - 1 # We'll get the file via HTTPS so we don't need to worry about S3 credentials # Getting the file on S3 is equivalent however - you can request a Range prefix = 'https://aws-publicdatasets.s3.amazonaws.com/' # We can then use the Range header to ask for just this set of bytes resp = requests.get(prefix + record['filename'], headers={'Range': 'bytes={}-{}'.format(offset, offset_end)}) # The page is stored compressed (gzip) to save space # We can extract it using the GZIP library raw_data = StringIO.StringIO(resp.content) f = gzip.GzipFile(fileobj=raw_data) # What we have now is just the WARC response, formatted: data = f.read() response = "" if len(data): try: warc, header, response = data.strip().split('\r

\r

', 2) except: pass return response # # Extract links from the HTML # def extract_external_links(html_content,link_list): parser = BeautifulSoup(html_content) links = parser.find_all("a") if links: for link in links: href = link.attrs.get("href") if href is not None: if domain not in href: if href not in link_list and href.startswith("http"): print "[*] Discovered external link: %s" % href link_list.append(href) return link_list record_list = search_domain(domain) link_list = [] for record in record_list: html_content = download_page(record) print "[*] Retrieved %d bytes for %s" % (len(html_content),record['url']) link_list = extract_external_links(html_content,link_list) print "[*] Total external links discovered: %d" % len(link_list) with codecs.open("%s-links.csv" % domain,"wb",encoding="utf-8") as output: fields = ["URL"] logger = csv.DictWriter(output,fieldnames=fields) logger.writeheader() for link in link_list: logger.writerow({"URL":link}) 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 record_list = search_domain ( domain ) link_list = [ ] for record in record_list : html_content = download_page ( record ) print "[*] Retrieved %d bytes for %s" % ( len ( html_content ) , record [ 'url' ] ) link_list = extract_external_links ( html_content , link_list ) print "[*] Total external links discovered: %d" % len ( link_list ) with codecs . open ( "%s-links.csv" % domain , "wb" , encoding = "utf-8" ) as output : fields = [ "URL" ] logger = csv . DictWriter ( output , fieldnames = fields ) logger . writeheader ( ) for link in link_list : logger . writerow ( { "URL" : link } )

Alright, this is the last of it, let’s review:

Line 118: we call our search_domain function to retrieve search results from Common Crawl.

we call our search_domain function to retrieve search results from Common Crawl. Lines 121-127: we iterate over our search results (121), pull down the raw HTML (123) and then extract the list of links from the page (127).

we iterate over our search results (121), pull down the raw HTML (123) and then extract the list of links from the page (127). Lines 132-140: we crack open a new CSV file named using our target domain (132) and then set a single column name for our spreadsheet (134). We initialize the DictWriter class with our logfile descriptor and column name (136) and then write out the header row in the CSV (137). Then we simple iterate over the list of links (139) and add each URL to our CSV (140).

Whew! Now of course you could capture other metadata from the search results (such as the timestamp) to store alongside each URL if you like, or you could make this whole operation recursive by then attempting to try each additional domain you have discovered for links that they link to. Really, the world is your OSINT oyster at this point. Let’s take it for a spin.

Let it Rip

You can run this from inside Wing (or other Python IDE) or from the command line like so:

C:\Python27\> python commoncrawler.py -d bellingcat.com

[*] Trying target domain: bellingcat.com

[*] Trying index 2014-52

[*] Trying index 2015-06

[*] Trying index 2015-11

[*] Added 7 results.

[*] Trying index 2015-14

[*] Added 7 results.

[*] Trying index 2015-18

[*] Added 7 results.

[*] Trying index 2015-22

[*] Added 7 results.

[*] Trying index 2015-27

[*] Added 6 results.

[*] Found a total of 34 hits.

[*] Retrieved 49622 bytes for https://www.bellingcat.com/

[*] Discovered external link: http://www.twitter.com/bellingcat

[*] Retrieved 49622 bytes for https://www.bellingcat.com/

[*] Retrieved 61330 bytes for https://www.bellingcat.com/news/uk-and-europe/2014/07/22/evidence-that-russian-claims-about-the-mh17-buk-missile-launcher-are-false/

[*] Discovered external link: https://www.youtube.com/watch?v=4bNPInuSqfs#t=1567

[*] Discovered external link: http://rt.com/news/174496-malaysia-crash-russia-questions/

<SNIP>

[*] Total external links discovered: 67

Cool, we discovered 67 external domains that Bellingcat links to. Not bad! If you check the same directory as your script you will see a CSV file that has all of the external URLs stored that you can then feed into your own crawler or manually investigate. Of course you could use some text processing on the external pages as well.