A really basic thing we can start with is taking a black-and-white conversion of the images in the dataset and calculating the Hamming Distance between them. I have a feeling this won't work particularly well, but it will be useful as a metric of comparison between this and other metrics (plus it should be fairly easy to implement).

We begin with a toy dataset of ten images, which I selected by hand to give a good representative example: The images roughly fall into four groups: [Jade + Robot Jade], [Jade, John, and Terezi at computers], [yellow, green, human hands], [two random images]. Likewise, we will only bother looking at the first frame in these images, despite the fact that they are gifs. As with the flashes, it's not that it would be too difficult to do this (merely splitting the gifs into each frame + instructing the program to ignore frames within the same gif for comparisons would be easy enough), but it would just be a bit more trouble than I think it's worth for now.

Ideally the images in these groups should resemble each other more than they resemble the other images, with the two random images as control. The images that are more direct art recycles should be more similar to each other than they are to merely-similar images (e.g. the images of John and Jade should resemble each other more than they do to Terezi, since John and Jade are in the same spot on the screen and Terezi is translated in the frame).

We can start by converting every image to a binary image consisting of only black and white pixels.

# Convert all images to binary image from PIL import Image import os for image in os.listdir( './screens/img/' ): img_orig = Image. open ( "./screens/img/" + image) img_new = img_orig.convert( '1' ) dir_save = './screens/binary/' + image img_new.save(dir_save)

This will allow us to compare each image with a simple pixel-by-pixel comparison and count the number of pixels where the two images differ. While this is very straightforward, it sort of leaves us at the mercy of what colors are used in the panel, so the conversion isn't perfect.

For example, we have the two hands panels converted to binary images. Here we see that the backgrounds are assigned different colors, as well as the blood being completely eliminated in the first image but not the second.

There's also some issues with objects blending into the background, which could cause issues as well.

This method will likely work extremely well for detecting duplicate images (since they will produce the same binary image) but leave something to be desired for redraws (which have flaws like the two mentioned above).

Anyways, lets give it a shot.

import PIL from PIL import Image import io, itertools, os from joblib import Parallel, delayed import multiprocessing import numpy as np def hamming (x, y): if len (x) == len (y): # Choosing the distance between the image or the image's inverse, whichever is closer return min ( sum (c1 != c2 for c1, c2 in zip (x, y)), sum (c1 == c2 for c1, c2 in zip (x, y))) else : return -1 def compare_img (image1, image2, dire, resize): i1 = Image. open (dire + image1) if resize: i1 = i1.resize((100,100)) i1_b = i1.tobytes() i2 = Image. open (dire + image2) if resize: i2 = i2.resize((100,100)) i2_b = i2.tobytes() dist = hamming(i1_b, i2_b) return dist # including here a helper function so I can call a function in parallel def output_format (image1, image2, dire, resize): return [image1, image2, compare_img(image1, image2, dire, resize)] def hamming_a_directory (dire, resize= True ): num_cores = multiprocessing.cpu_count() return Parallel(n_jobs=num_cores)(delayed(output_format)(image1, image2, dire, resize)\ for image1, image2 in itertools.combinations(os.listdir(dire), 2)) def quantize (img_arr, dimx=8, dimy=8): quantized = [] for x in img_arr: if x >= np.mean(img_arr): quantized.append(255) else : quantized.append(0) return quantized

<<hamming-functions>> full_list = hamming_a_directory( './screens/binary/' ) full_list.sort(key= lambda x: int (x[2])) return full_list[:10]

1525 1 .gif 1525 2 .gif 2179 2079 2 .gif 2338 1 .gif 2680 1033 1 .gif 1530 1 .gif 2691 2488 1 .gif 2079 2 .gif 2695 1870 1 .gif 1033 1 .gif 2917 1525 2 .gif 1530 1 .gif 3204 1034 1 .gif 1525 2 .gif 3240 1870 1 .gif 1530 1 .gif 3242 1034 1 .gif 1530 1 .gif 3330 2338 1 .gif 1530 1 .gif 3539

A surprisingly solid baseline! Here we can see that the most similar images with this method are 1525 1 and 1525 2 (John and Jade), which are redraws of each other. Likewise, it catches the similarity between 2079 2 and 2338 1 (the two hands) as well as comparing 2079 2 and 2488 1 (one of the hands + the human gag version).

There are some misses, though – 1530 is considered similar to 1033 despite the two panels being largely unrelated, which I suspect is largely because of the background for both images being solid black. Likewise, it misses the comparison between 1033 1 and 1034 1 , and doesn't compare panels 2338 1 and 2488 1 despite favorably comparing both of those panels to 2079 2 .

So it's clear we can use this to compare images to find similarities, but lets see if we can't get something slightly better.