In this post, we’ll talk about implementation details and we’ll focus on images. Building a system that can deal with processing millions of images daily is one of the most difficult parts of building a duplicate detection system, that’s why we cover this part in detail.

In particular, this post talks about:

Why images are important for detecting duplicates and how they can be used;

Perceptual image hashes and how to use them for duplicate detection;

Implementing a system for detecting image duplicates with AWS and Elasticsearch.

Like in the previous post, we talk about a generic approach to duplicate detection, and the system described here doesn’t necessarily 100% reflect what we use at OLX. We do this on purpose, so this material cannot be used for bypassing our moderation system and harming our users. On the other hand, the post gives enough information to build your own duplicate detection system.

A Picture is Worth a Thousand Words

Before we start, let’s quickly go over the main idea from the previous post: the two-step framework for duplicate detection.

The framework consists of two steps: candidate selection step and candidate scoring step.

Candidate selection step. At this step, we find candidate duplicates: listings that are likely to be duplicates. For that we use domain knowledge and look at listings published in the same category, from the same city; and listings published by the same author, from the same IP or the same device.

At this step, we find candidate duplicates: listings that are likely to be duplicates. For that we use domain knowledge and look at listings published in the same category, from the same city; and listings published by the same author, from the same IP or the same device. Candidate scoring step. At this step, we use machine learning for scoring the candidates from the previous step to get the actual duplicates. We covered simple features that we can use: the basic features like the difference in price, geographical distance and whether two listings are published from the same IP; and more complex features that involve calculating text similarity between the two titles or the two descriptions.

With these basic features and text similarities, we can build a model that does quite well for the second step.

However, we didn’t talk about one particular type of content: images. In online classifieds, this is one of the most important aspects of a listing, which is why we need to use them in the duplicate detection model. We’ll see how to do it in the next section.

Image Hashes

Images play an important role in online classifieds, so we need to extract the information we have there in order to have a good model. One of the possible ways of doing it is by using image hashes: we extract hashes from images and use them for determining how similar two images are.

There are two types of hashes for images: cryptographic and perceptual.

The cryptographic hashes, like MD5, are general-purpose hashes: they work for all files, not just images. They are typically used for detecting if a file was manipulated with or not, so hashes like MD5 are used as a checksum. Even the smallest modification in the file — for example, a one-bit change — results in a completely different hash.

For example, we can take an image of a car and compute its MD5 hash:

If we modify it slightly by adding a small white rectangle at the corner, we’ll have a completely different MD5 hash:

Using MD5 hashes for duplicate detection is possible, and quite beneficial, but as we see, it only tells us if two ads have exactly the same byte-by-byte images — and nothing more.

Perceptual hashes are different: they are designed specifically for images. When an image is modified slightly, the hash shouldn’t change significantly, and often it won’t change at all. There are a few such hashes: ahash, dhash, phash, and whash. You can read more about these hashes here.

If we take the same images as previously and calculate the dhashes for them, they’ll be the same for both images:

In this particular case, dhash isn’t sensitive to adding a small white rectangle in the corner, so the resulting hash doesn’t change.

There are other modifications that shouldn’t affect the hash:

Resize of the image;

Saving in a different format, e.g. JPEG with a different compression level or WEBP;

Small modifications of the image, e.g. adding invisible noise, or a small watermark.

Unfortunately, there are some modifications that break these simple hashes. For example, cropping a part of the image, rotating the image, or reflecting it does affect the hash. For catching such modifications other methods, like neural networks, should be used.

Perceptual Hashes: dhash

The basic perceptual hashes are quite simple to implement. Let’s take a look at how to compute dhash.

Suppose we want to compute a 64-bit hash. For that, we first need to resize the image to the size of 8x9 pixels. What it effectively does is dividing the image into 8x9 blocks and calculating the average pixel value in the block:

Next, we take each column of this array and subtract it from the column next to it. That is, we calculate the difference between the 2nd column and the 1st, the 3rd and 2nd, 4th and 3rd and so on. After doing that, we have an 8x8 array with differences:

Now we look only at the sign of the difference: for each value, if it’s positive, we replace it with TRUE, and if it’s negative, we replace it with FALSE. This way we get an 8x8 array with boolean values:

Finally, we convert it into ones and zeros and treat each row of the array as an 8-bit integer. The first row is “10010100”, which is 148 when converted from binary to decimal, or “94” in hex.

We repeat it for each row, and this way, we get eight 8-bit numbers in hex. Finally, we put all the hexes together to get the final hash: in this case, it’s “94088af86c038327”.

It’s a heuristic, but it’s quite simple and efficient to compute. Other hashes, like phash, work similarly, but phash first applies the Fourier transformation to the image, and whash uses wavelets.

Hash-Based Similarity

When we have hashes, we can use them for comparing the images of two listings. To check how different two hashes are, we can calculate the number of positions where the bits of the hashes don’t match. This is called “hamming distance” — the number of different bits in two binary arrays. We can compute the Hamming distances between all the images and this will give us a set of good features to a machine learning model.

For example, suppose we have two hashes, “94088af86c038327” and “94088af86c038328”. These hashes are different only in the last character: it’s “7” for the first hash and “8” for the second.

The hamming distance between these hashes is 4 bits: when we compare the binary representation of the hashes, only 4 bits are different: