Not long ago I was intrigued by the Oct282011.com Internet mystery (if you haven’t heard of it check out this podcast). Friends of the Hunchly mailing list and myself embarked on a brief journey to see if we could root out any additional clues or, of course, solve the mystery. One of the major sources of information for the investigation was The Wayback Machine, which is a popular resource for lots of investigations.

For this particular investigation there were a lot weird images strewn around as clues, and I wondered if it would be possible to retrieve those photos from the Wayback Machine and then examine them for EXIF data to see if we could find authorship details or other tasty nuggets of information. Of course I was not going to do this manually, so I thought it was a perfect opportunity to build out a new tool to do it for me.

We are going to leverage a couple of great tools to make this magic happen. The first is a Python module written by Jeremy Singer-Vine called waybackpack. While you can use waybackpack on the commandline as a standalone tool, in this blog post we are going to simply import it and leverage pieces of it to interact with the Wayback Machine. The second tool is ExifTool, by Phil Harvey. This little beauty is the gold standard when it comes to extracting EXIF information from photos and is trusted the world over.

The goal is for us to pull down all images for a particular URL on the Wayback Machine, extract any EXIF data and then output all of the information into a spreadsheet that we can then go and review.

Let’s get rocking.

Prerequisites

This post involves a few moving parts, so let’s get this boring stuff out of the way first.

Installing Exiftool

On Ubuntu based Linux you can do the following:

# sudo apt-get install exiftool

Mac OSX users can use Phil’s installer here.

For you folks on Windows you will have to do the following:

Download the ExifTool binary from here. Save it to your C:\Python27 directory (you DO have Python installed right?)

Rename it to exiftool.exe

Make sure that C:\Python27 is in your Path. Don’t know how to do this? Google will help. Or just email me.

Installing The Necessary Python Libraries

Now we are ready to install the various Python libraries that we need:

pip install bs4 requests pandas pyexifinfo waybackpack

Alright let’s get down to it shall we?

Coding It Up

Now crack open a new Python file, call it waybackimages.py (download the source here) and start pounding out (use both hands) the following:

import bs4 import hashlib import json import os import pandas import pyexifinfo import requests import sys import urlparse import waybackpack # ensure to place the trailing / for base domains url = "http://www.oct282011.com/" reload(sys) sys.setdefaultencoding("utf-8") if not os.path.exists("waybackimages"): os.mkdir("waybackimages") 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import bs4 import hashlib import json import os import pandas import pyexifinfo import requests import sys import urlparse import waybackpack # ensure to place the trailing / for base domains url = "http://www.oct282011.com/" reload ( sys ) sys . setdefaultencoding ( "utf-8" ) if not os.path . exists ( "waybackimages" ) : os . mkdir ( "waybackimages" )

Nothing too surprising here. We are just importing all of the required modules, we set our target URL and then create a directory for all of our images to be stored.

Let’s now implement the first function that will be responsible for querying the Wayback Machine for all unique snapshots of our target URL:

# # Searches the Wayback machine for the provided URL # def search_archive(url): # search for all unique captures for the URL results = waybackpack.search(url,uniques_only=True) timestamps = [] # build a list of timestamps for captures for snapshot in results: timestamps.append(snapshot['timestamp']) # request a list of archives for each timestamp packed_results = waybackpack.Pack(url,timestamps=timestamps) return packed_results 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # # Searches the Wayback machine for the provided URL # def search_archive ( url ) : # search for all unique captures for the URL results = waybackpack . search ( url , uniques_only = True ) timestamps = [ ] # build a list of timestamps for captures for snapshot in results : timestamps . append ( snapshot [ 'timestamp' ] ) # request a list of archives for each timestamp packed_results = waybackpack . Pack ( url , timestamps = timestamps ) return packed_results

Line 24: we define our search_archive function to take the url parameter which represents the URL that we want to search the Wayback Machine for.

we define our function to take the parameter which represents the URL that we want to search the Wayback Machine for. Line 27: we leverage the search function provided by waybackpack to search for our URL and also we specify that we only want unique captures so that we aren’t having to examine a bunch of duplicate captures.

we leverage the function provided by waybackpack to search for our URL and also we specify that we only want unique captures so that we aren’t having to examine a bunch of duplicate captures. Lines 29-33: we create an empty list of timestamps (29) and then begin walking through the results of our search (32) and add the timestamp that corresponds to a particular capture in the Wayback Machine (33).

we create an empty list of timestamps (29) and then begin walking through the results of our search (32) and add the timestamp that corresponds to a particular capture in the Wayback Machine (33). Lines 36-38: we pass in the original URL and the list of timestamps to create a Pack object (36). A Pack object assembles the timestamps and the URL into a Wayback Machine friendly format. We then return this object from our function (38).

Now that our search function is implemented, we need to process the results, retrieve each captured page and then extract all image paths stored in the HTML. Let’s do this now.

# # Retrieve the archived page and extract the images from it. # def get_image_paths(packed_results): images = [] count = 1 for asset in packed_results.assets: # get the location of the archived URL archive_url = asset.get_archive_url() print "[*] Retrieving %s (%d of %d)" % (archive_url,count,len(packed_results.assets)) # grab the HTML from the Wayback machine result = asset.fetch() # parse out all image tags soup = bs4.BeautifulSoup(result) image_list = soup.findAll("img") # loop over the images and build full URLs out of them if len(image_list): for image in image_list: if not image.attrs['src'].startswith("http"): image_path = urlparse.urljoin(archive_url,image.attrs['src']) else: image_path = image.attrs['src'] if image_path not in images: print "[+] Adding new image: %s" % image_path images.append(image_path) count += 1 return images 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 # # Retrieve the archived page and extract the images from it. # def get_image_paths ( packed_results ) : images = [ ] count = 1 for asset in packed_results . assets : # get the location of the archived URL archive_url = asset . get_archive_url ( ) print "[*] Retrieving %s (%d of %d)" % ( archive_url , count , len ( packed_results . assets ) ) # grab the HTML from the Wayback machine result = asset . fetch ( ) # parse out all image tags soup = bs4 . BeautifulSoup ( result ) image_list = soup . findAll ( "img" ) # loop over the images and build full URLs out of them if len ( image_list ) : for image in image_list : if not image . attrs [ 'src' ] . startswith ( "http" ) : image_path = urlparse . urljoin ( archive_url , image . attrs [ 'src' ] ) else : image_path = image . attrs [ 'src' ] if image_path not in images : print "[+] Adding new image: %s" % image_path images . append ( image_path ) count += 1 return images

Line 43: we setup our get_image_paths function to receive the Pack object.

we setup our function to receive the object. Lines 48-56: we walk through the list of assets (48), and then use the get_archive_url function (51) to hand us a useable URL. We print out a little helper message (53) and then we retrieve the HTML using the fetch function (56).

we walk through the list of assets (48), and then use the function (51) to hand us a useable URL. We print out a little helper message (53) and then we retrieve the HTML using the function (56). Lines 59-60: now that we have the HTML we hand it off to BeautifulSoup (59) so that we can begin parsing the HTML for image tags. The parsing is handled by using the findAll function (60) and passing in the img tag. This will produce a list of all IMG tags discovered in the HTML.

now that we have the HTML we hand it off to BeautifulSoup (59) so that we can begin parsing the HTML for image tags. The parsing is handled by using the function (60) and passing in the img tag. This will produce a list of all IMG tags discovered in the HTML. Lines 63-70: we walk over the list of IMG tags found (64) and we build URLs (67-70) to the images that we can use later to retrieve the images themselves.

we walk over the list of IMG tags found (64) and we build URLs (67-70) to the images that we can use later to retrieve the images themselves. Lines 72-74: if we don’t already have the image URL (72) we print out a message (73) and then add the image URL to our list of all images found (74).

Alright! Now that we have extracted all of the image URLs that we can, we need to download them and process them for EXIF data. Let’s implement this now.

# # Download the images and extract the EXIF data. # def download_images(image_list,url): image_results = [] image_hashes = [] for image in image_list: # this filters out images not from our target domain if url not in image: continue try: print "[v] Downloading %s" % image response = requests.get(image) except: print "[!] Failed to download: %s" % image continue if "image" in response.headers['content-type']: sha1 = hashlib.sha1(response.content).hexdigest() if sha1 not in image_hashes: image_hashes.append(sha1) image_path = "waybackimages/%s-%s" % (sha1,image.split("/")[-1]) with open(image_path,"wb") as fd: fd.write(response.content) print "[*] Saved %s" % image info = pyexifinfo.get_json(image_path) info[0]['ImageHash'] = sha1 image_results.append(info[0]) return image_results 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 # # Download the images and extract the EXIF data. # def download_images ( image_list , url ) : image_results = [ ] image_hashes = [ ] for image in image_list : # this filters out images not from our target domain if url not in image : continue try : print "[v] Downloading %s" % image response = requests . get ( image ) except : print "[!] Failed to download: %s" % image continue if "image" in response . headers [ 'content-type' ] : sha1 = hashlib . sha1 ( response . content ) . hexdigest ( ) if sha1 not in image_hashes : image_hashes . append ( sha1 ) image_path = "waybackimages/%s-%s" % ( sha1 , image . split ( "/" ) [ - 1 ] ) with open ( image_path , "wb" ) as fd : fd . write ( response . content ) print "[*] Saved %s" % image info = pyexifinfo . get_json ( image_path ) info [ 0 ] [ 'ImageHash' ] = sha1 image_results . append ( info [ 0 ] ) return image_results

Let’s pick this code apart a little bit:

Line 83: we define our download_images function that takes in our big list of image URLs and the original URL we are interested in.

we define our function that takes in our big list of image URLs and the original URL we are interested in. Lines 85-86: we create our image_results variable (85) to hold all of our EXIF data results and the image_hashes variable (86) to keep track of all of the unique hashes for the images we download. More on this shortly.

we create our variable (85) to hold all of our EXIF data results and the variable (86) to keep track of all of the unique hashes for the images we download. More on this shortly. Lines 88-98: we walk through the list of image URLs (88) and if our base URL is not in the image path (91) then we ignore it. We then download the image (96) so that we can perform our analysis.

we walk through the list of image URLs (88) and if our base URL is not in the image path (91) then we ignore it. We then download the image (96) so that we can perform our analysis. Lines 101-103: if we have successfully downloaded the image (101), we then SHA-1 hash the image so that we can track unique images. This allows us to have multiple images at with the same file name but if the contents of the image are different (even by one byte) then we will track them separately. This also prevents us from having multiple copies of exact duplicate images.

if we have successfully downloaded the image (101), we then SHA-1 hash the image so that we can track unique images. This allows us to have multiple images at with the same file name but if the contents of the image are different (even by one byte) then we will track them separately. This also prevents us from having multiple copies of exact duplicate images. Lines 105-114: if it is a new unique image (105) we add the hash to our list of image hashes (107) and then we write the image out to disk (111).

if it is a new unique image (105) we add the hash to our list of image hashes (107) and then we write the image out to disk (111). Line 116: here we are calling the pyexifinfo function get_json . This function extracts the EXIF data and then returns the results as a Python dictionary.

here we are calling the pyexifinfo function . This function extracts the EXIF data and then returns the results as a Python dictionary. Lines 118-120: we add our own key to the info dictionary that contains the SHA-1 hash of the image (118) and then we add the dictionary to our master list of results (120).

We are almost finished. Now we just need to tie all of these functions together and get some output in CSV format so that we can easily review all of the EXIF data that we have discovered. Time to put the finishing touches on this script!

results = search_archive(url) print "[*] Retrieved %d possible stored URLs" % len(results.assets) image_paths = get_image_paths(results) print "[*] Retrieved %d image paths." % len(image_paths) image_results = download_images(image_paths,url) # return to JSON and have pandas build a csv image_results_json = json.dumps(image_results) data_frame = pandas.read_json(image_results_json) csv = data_frame.to_csv("results.csv") print "[*] Finished writing CSV to results.csv" 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 results = search_archive ( url ) print "[*] Retrieved %d possible stored URLs" % len ( results . assets ) image_paths = get_image_paths ( results ) print "[*] Retrieved %d image paths." % len ( image_paths ) image_results = download_images ( image_paths , url ) # return to JSON and have pandas build a csv image_results_json = json . dumps ( image_results ) data_frame = pandas . read_json ( image_results_json ) csv = data_frame . to_csv ( "results.csv" ) print "[*] Finished writing CSV to results.csv"

Let’s break down this last bit of code:

Lines 124-132: we call all of our functions starting by performing the Wayback Machine search (124), extracting the image paths (128) and then downloading and processing all of the images (132).

we call all of our functions starting by performing the Wayback Machine search (124), extracting the image paths (128) and then downloading and processing all of the images (132). Lines 135-139: we convert the returned dictionary to a JSON string (135), and then pass that JSON into the pandas read_json function (137) that will create a dataframe in pandas. We then leverage a wonderful function to_csv that converts that data frame to a CSV file, complete with automatic headers. This saves us from having to code up a complicated CSV creation routine. The CSV file is stored in results.csv in the same directory as the script.

Let It Rip!

Ok now for the fun part. Set the URL you are interested in, and just run the script from the command line or from your favourite Python IDE. You should see some output like the following:

[*] Retrieved 41 possible stored URLs

[*] Retrieving https://web.archive.org/web/20110823161411/http://www.oct282011.com/ (1 of 41)

[*] Retrieving https://web.archive.org/web/20110830211214/http://www.oct282011.com/ (2 of 41)

[+] Adding new image: https://web.archive.org/web/20110830211214/http://www.oct282011.com/st.jpg

…

[*] Saved https://web.archive.org/web/20111016032412/http://www.oct282011.com/material_same_habits.png

[v] Downloading https://web.archive.org/web/20111018162204/http://www.oct282011.com/ignoring.png

[v] Downloading https://web.archive.org/web/20111018162204/http://www.oct282011.com/material_same_habits.png

[v] Downloading https://web.archive.org/web/20111023153511/http://www.oct282011.com/ignoring.png

[v] Downloading https://web.archive.org/web/20111023153511/http://www.oct282011.com/material_same_habits.png

[v] Downloading https://web.archive.org/web/20111024101059/http://www.oct282011.com/ignoring.png

[v] Downloading https://web.archive.org/web/20111024101059/http://www.oct282011.com/material_same_habits.png

[*] Finished writing CSV to results.csv