The website in question, 9Gag, is a social media platform featuring user generated content in form of videos, images and gifs. During the first semesters of university, consuming the content right before going to bed was a part of my daily routine and a great time sink.

Since then, the quality of the submitted entries gradually decreased and the serious flaws in the website’s operation became apparent:

Reposts: Similar to other well known social media sites, users are incentivised to upload popular content. Some individuals are motivated to reupload content that has already been posted by someone else. Seeing the same images over and over gets tiresome pretty quickly. Filtering promotional content and debunk vote manipulation: The company behind the platform is accused to feature promotional content disguised as normal posts. Let’s figure out if we can differentiate legit and artificial content. A bit more information can be found here. Meme classification: There is no possibility to look at a specific meme category. Figure out how to categorize the data in a suited way.

The above tasks are really well suited to be solved by utilizing a technique known as perceptual image hashing, which I stumbled upon earlier last year by reading Dr. Neal Krawetz’s blog post “kind of like that”.

Fitting: An image found while looking at my scrapped data ….

All source code is available at github. If you want to follow along clone the project https://github.com/KilianB/JImageHash.git and take a look at the example package.

Note: We will be working with around 15.000 images (~1 GB of data). IO Operations take up the majority of time. To reduce the time spend on file loads the usage of an SSD as temporary storage is highly encouraged.

Data Retrieval

To find and remove duplicates we first have to gain access to the images saved on the website.

… and the correct answer

I was already down to firing up Eclipse and write a hideous web scrapping script possibly using selenimum. Luckily spending 4 minutes looking through the network requests payed off and the (hidden) endpoint of the internal 9gag API surfaced.

https://9gag.com/v1/group-posts/group/default/type indubitably looks like an ordinary REST API call, and surely enough if queried it returns a bunch of meta data in a well structured JSON format. What a fortunate discovery.

{"meta":{"timestamp":1547216490,"status":"Success","sid":"9gVQ01EVjlHTUVkMMRVSzwEVJBTTn1TY"},"data":{"posts":[{"id":"amBvdPj","url":"http:\/\/9gag.com\/gag\/amBvdPj","title":"Gervais with a solution for The Oscars","type":"Photo","nsfw":0,"upVoteCount":825,"downVoteCount":27.....

,"nextCursor":"after=a1QWwXY%2Ca5MWxZG%2Ca1QWw8P&c=10"}}

Appending the nextCursor attribute to the url allows us retrieve the next batch of data. Even better, slightly modifying the url enables a more specific search.

The usage of https://9gag.com/v1/group-posts/group/default/type/SECTION_NAME/?[query=QUERY][&after=CURSOR] allows us to scrap data from different categories.

Note: The section name needs to be in all lowercase, the nextCursor token is missing at an arbitrary depth (usually around 600–1000 recursive calls), marking the end of available data.

A post on 9Gag roughly follows this life cycle:

new posts are published in the fresh category,

category, if they receive enough attention they are moved to the trending section

section if they are among the top commented upvoted and shared posts they will be promoted to hot.

After downloading the metadata and saving them in a database, allowing for future SQL queries to rapidly answer more specific questions, the images were downloaded.

Source code to recreate the example. For specifics see github.

Keep an eye out for rate limiting (spam prevention) . Either my internet connection got clogged, their servers were busy or at some point or cloudflare did not like the constant requests hitting their servers while doing repeated checking to ensure code stability for the examples.

Taking a quick look at the gathered data

To stroll around in the data it is recommended to install the dbeaver eclipse extension via the marketplace, alternatively the h2 database engine allows usage via the browser. Non of these are required to proceed with the example project.

Roughly 17.000 images to work with. About 3/4 of them are images gifs. Video’s make up a small fraction of the posts. For now we will only treat the very first frame of a video or gif to find duplicates, since the thumbnail is conveniently available in the scrapped data.

Content at different sections

We can see that gifs and videos are more likely to make it to the hot section. Articles are an arbitrary combination of text, images and gifs (<0.2% of all submissions)

Example of different tables in the SQL database

Upon closer inspection it became apparent that the lifecycle isn’t exactly as described earlier. Posts can be both in trending, hot and fresh at the same time. Of the 17.000 original entries we are left with 15.000 unique posts. The hot section covers about 11 days worth of data, trending 4 and fresh 1 1/2

Interesting here is the field promoted (sadly unused). The downvote count of post, which got hidden from the website some time ago, as well as that the downvotes and upvotes are available even if the vote masked field is set to true. The vote mask indicates that the votes should be masked from the user to prevent the bandwagon effect.

Come on 9Gag api team, don’t ever submit data to the client you don’t want the user to see. Someone could go right ahead and write a browser extension to expose the details.

Data preprocessing and cleansing

Remember only a random thumbnail image being available for videos and gifs. Of the 6000 hot entries around 55 contain a black frame.