Warning! This post includes some links to NSFW (not suitable for work) galleries. You had better study this post at home :)

Problem

On the web you can find lots of free XXX galleries. There are also sites that collect these galleries and update their list at a daily frequence. When you visit such a gallery, you get either (1) images, or (2) links to images through thumbnails. But! Beside these relevant images, there is always some noise: banners, other thumbnails, links to other galleries, etc.

How to write a universal scraper that gets the URL of a gallery and it extracts just the relevant images without any noise? How to separate real content from noise?

Example

Let’s see a soft gallery: http://biertijd.xxx/index.php?itemid=44329 (NSFW!). Extracting all the images we get the following list:

"urls": [ "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg", "http://biertijd.com/nucleus/plugins/rating/4.gif", "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=0da28fe49a3d6d2fa7e17d15b9a05d28", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", "http://s4.histats.com/stats/0.gif?37757&1" ]

As you can see, the relevant images conform to this pattern: “ http://media01.biertijd.com/galleries/metart/131107_night/{01..20}.jpg “. Altogether we have 35 images of which only 20 are relevant. How to find these 20 only?

Solution

The good news is that the relevant images usually follow a pattern and thus they don’t differ much. As seen above, in this example just the numbering of the images were different.

Relevant images can be separated from the others using text clustering. I found a great solution here by Rajesh M. Rajesh uses this method for clustering article titles. We will use it to cluster URLs, which are also just strings.

I put my solution in a class. Here it is:

#!/usr/bin/env python # based on: # http://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/ from helper import lev_dist as distance from pprint import pprint DISTANCE = 10 class Cluster(object): """ Clustering a list of (sorted!) strings. I use it for clustering URLs. After extracting all the links (or images) from a web page, I use this class to group together similar URLs. It also identifies the largest cluster. """ def __init__(self): self.clusters = {'clusters': {}} def clustering(self, elems): """ Clusterize the input elements. Input: list of words (e.g. list of URLs). It MUST be sorted! Process: build a dictionary where keys are cluster IDs (int) and values are lists (elements in the given cluster) """ clusters = {} cid = 0 for i, line in enumerate(elems): if i == 0: clusters[cid] = [] clusters[cid].append(line) else: last = clusters[cid][-1] if distance(last, line) maxi_v: maxi_v = len(v) maxi_k = k # return clusters[maxi_k] def show(self): pprint(self.clusters) def get_clusters(elems): elems = sorted(elems) cl = Cluster() cl.clustering(elems) return cl.clusters['clusters'] ############################################################################# if __name__ == "__main__": import sys template = "https://jabbalaci.herokuapp.com/all_images?url={url}?&clusters=1" if len(sys.argv) == 1: print "Usage: {0} URL".format(sys.argv[0]) sys.exit(1) # else url = template.format(url=sys.argv[1]) import requests r = requests.get(url) li = sorted(r.json()['urls']) cl = Cluster() cl.clustering(li) cl.show()

The extracted URLs are sorted first. Then, they are put in clusters. The idea is simple. Put the first element in the current cluster, which is the first cluster. If the next element is similar, put it into the first cluster again. If it’s different, create a new cluster (it will be the current cluster) and add to it. And so on.

To tell how similar two strings are, we use the Levenshtein distance. You can find an implementation here.

Demo

This method is implemented as a web service. It has two versions: you can cluster links, or you can cluster images. Which one to use? It depends on the gallery. If it includes the relevant images, then extract the images. If it contains thumbnails that point to images, then extract links.

Don’t forget to switch on the “text clustering” option. In the output you will get the clusters and to facilitate your life, the largest cluster is also indicated. In most of the cases, this is the cluster that contains the relevant images!

Sample output:

... "clusters": { "0": [ "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=d079746fd366f6f3509532688d595fcb" ], "1": [ "http://biertijd.com/nucleus/plugins/rating/4.gif" ], "2": [ "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png" ], "3": [ "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg" ], "4": [ "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg" ], "5": [ "http://s4.histats.com/stats/0.gif?37757&1" ], "largest": [ "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg" ], "number_of_clusters": 6 }, ...

Demo for the lazy pigs

I made a page that extracts relevant links/images from a gallery and presents them in a cleaned gallery. It’s available here: https://jabbalaci.herokuapp.com/gallery .

Usage: insert the gallery’s URL then click on the first button. If you click on an image and it’s just a thumbnail, then click on the second button.

It extracts the largest cluster and it gives good results in most cases.

Feedbacks are welcome.

Links

this post appeared in Python Weekly #114 (Nov. 2013)

discussion @reddit