Vision Image Similarity

Image similarity and classification are not the same, as class labels do not signify similarity. An image classification model typically returns generated labels as an output, whereas the image similarity request is responsible for computing the similarity between the two images.

Classification requests fall under supervised learning since they follow a set of instructions from inputs to return the output target results. Image similarity, on the other hand, is unsupervised since the input doesn’t have a set of instructions and relies on feature extraction to find relevant similarities across images.

There are quite a number of techniques to compute image similarity, and comparing image pixel values is the most trivial and ineffective one. The same image in a different lighting/shade would give different pixels and would be determined as different from the source image despite being very similar in content. The following illustration from the WWDC videos depicts this:

Vision Feature Prints

Luckily, the Vision framework consists of a classification network that’s trained to determine the feature descriptors of the image in its uppermost model layer. This saves us from creating our own models for extracting features from images, as Vision already provides feature prints in its API. A FeaturePrint is a vector descriptor of the image.

The following code showcases how to determine the feature prints from a Vision request and compute the Euclidean distance between the images. The distance determines how close/far away the images are on the Euclidean map. The smaller the distance, the more similar the images are.

let requestHandler = VNImageRequestHandler(cgImage: image.cgImage!, options: [:]) let request = VNGenerateImageFeaturePrintRequest() do { try requestHandler.perform([request]) let result = request.results?.first as? VNFeaturePrintObservation

var distance = Float(0)

try result?.computeDistance(&distance, to: sourceResult) }catch{ }

In the above code, the VNGenerateImageFeaturePrintRequest returns a VNFeaturePrintObservation , which is used to compute a floating-point distance from the source image.

Moving on, in the next section, we’ll be developing an iOS application that uses the feature prints of images to sort them by similarity.

Implementation

To start off, we’ll set up our SwiftUI view which holds a source image (reference image) and a List of images. The idea is to run VNGenerateImageFeaturePrintRequest over each of the images and compute their distance with the feature print of the source image. Subsequently, we’ll sort the SwiftUI List to display the most similar images at the top.

Here’s a glimpse of the initial state of our UI:

So we’ve kept an image of a car as the source image, and a few random images (including the ones of the same car model in different poses) in the SwiftUI List. We’ll soon see how accurately Vision’s image similarity works when computing the feature prints of the images. But first, let’s set up our List and its model.

Creating a Model for the SwiftUI List

Our SwiftUI List will hold the image and the computed distance against the source image in a struct as shown below:

struct ModelData : Identifiable{ public let id: Int public var imageName : String public var distance : String = "NA" }

Conforming to the Identifable protocol is important for the elements in the List to have a unique identifier.

Building Our SwiftUI View

Next, we’ll setup the SwiftUI view to hold the modelData and the source image as shown below: