What’s a Descriptor, Anyway?

The reality is that extracting edges, corners or blobs from an image by itself isn’t as helpful as we would like. We just have points within an image — how can we start to understand the image a bit deeper? How do we go from having a bunch of points in an image to understanding that there’s an airplane or a person in the image? Feature Descriptors are algorithms that look at additional information around a Feature Point to better understand what it represents. This additional information is a calculated summary of the pixels immediately surrounding each feature point. With this additional information, we can more accurately identify that specific feature point across multiple frames of video, on stereo image pairs, panoramic image sequences, or any of the many implementations of Computer Vision.

Since this is an area I personally found very confusing when I started learning about CV, let’s recap the differences between a Feature Point and a Feature Descriptor. A Feature Point is a small area in an image (sometimes as small as a pixel) which has some measurable property of the image or object. A Feature Descriptor can be any additional data that can help provide greater definition to the Feature Point.

The additional data in a feature descriptor makes our features more robust against additional geometric or other variances. We’ve discussed Scale Invariance at length. But what about all of the other ways that a feature could change from image to image?

Geometric Variance comes in many flavors. Feature Descriptors allows us to better deal with other types of Geometric Variance. (Original Image Source: Carlos Spitzer)

We’ve already covered some types of geometric invariance already when discussing Feature Points. For instance, the Harris Corner detector is Rotation Invariant and Difference of Gaussian is Scale Invariant. (However, it’s worth mentioning that invariance in Feature Detectors is a bit different than invariance in Feature Descriptors.)

Ideally, we can come up with an approach of creating Feature Descriptors which works in all of the above cases.

Simple Descriptor

A simple example of a Feature Descriptor would be to look at the pixels within a box surrounding the Feature. For example, we could look at a square of 15x15 pixels centered around our feature. We could then try to compare that 15x15 box of pixels against a similar box of pixels in other images. Think of this a lot like the template/image matching technique we used previously where we were detecting the moon. Instead of trying to match one entire image, we can look for various smaller chunks of pixels centered around features and match those instead.

A very simple approach to creating Feature Descriptors would be to include pixels immediately surrounding a Feature Point (Image Source: Pixabay)

Much like when we were working with templates, this simplistic descriptor approach has some issues. Mainly, if the descriptors change in scale or orientation, our chances of successful matches are going to be low. For this reason, we’re going to be discussing SIFT, one of the most popular methods of creating feature descriptors.

SIFT

The SIFT detector is a scale-invariant feature descriptor algorithm which is based on the Difference of Gaussian detector. Scale Invariance is right in the name — SIFT stands for Scale Invariant Feature Transform. SIFT is not only scale-invariant; it’s also orientation-invariant. Modern Feature Descriptor algorithms such as SIFT considers both scale, orientation and other invariances.

SIDENOTE: The SIFT detector is actually patented by the University of British Columbia. The use the SIFT detector in commercial application requires a license. The patent is expected to expire in March of 2020. SIFT is an incredibly influential algorithm in CV and lots of other algorithms have concepts in common with SIFT. Alternative algorithms that are not patent encumbered are ORB, BRISK, FREAK among others — some of which are available in OpenCV.

Making the descriptor algorithm orientation or rotation invariant is a bit trickier. For every feature point we define, we need to extrapolate an orientation so that we can properly match that same descriptor on another image where the object has been rotated. Take a look at the following example. We have a common object, such as the Ace of Spades, in different orientations. In order to accurately match the features, we need to know which direction a feature “faces. I like to think that we are “normalizing” the direction of a feature point so we can properly match it in any orientation.

Orientation Invariance is important if we are looking to robustly match feature points across various images. (Image Source: Will Porada and Aditya Chinchure)

We’ve already discussed that it’s important to take into account the pixel data immediately surrounding our feature point. A common method of determining the orientation of a descriptor is use the gradient information surrounding the feature point in order to determine its dominant orientation.

The concept of image gradients is not something we’ve discussed, but it’s another concept that is really important in computer vision. A Gradient is the largest change in intensity for a given pixel. To calculate the gradient of any individual pixel, we look at all of its neighbors and determine which neighboring pixel is the most different from itself. Let’s check out an example. Given the following image, we are trying to determine the gradient direction and magnitude for our feature point which is highlighted below and has a value of 86.

The neighbors of our feature point (with a pixel value of 86) are 122, 42, 149, 129, 165, 200, 254 and 179. We then subtract our pixel value from its eight neighbors to determine which neighboring pixel has the greatest change in intensity.

| 122–86 | = 36

| 42–86 | = 44

| 159–86 | = 73

| 129–86 | = 43

| 165–86 | = 79

| 200–86 | = 114

| 254–86 | = 168

| 179–86 | = 93

Looking at the results, we can see that the largest change in intensity is between our pixel with a value of 86 and the pixel immediately below it with a value of 254. The orientation of the gradient of our feature point is in the downward direction and the magnitude is 168. (Since we have a direction and magnitude, we can represent a gradient using a 2D vector.)

We repeat this process for each of the pixels surrounding our feature point and we come up with something that might look like the following:

Once we have determined all of the gradients surrounding a feature point, we can figure out the dominant orientation — that is, what direction comes up most often in the pixels surrounding a given feature point. This method is referred to as a histogram of gradients.

Looking at the above data, we can see that the dominant direction for our feature point is down and to the right.

SIFT goes one step further than just looking at the dominant direction of the feature point. SIFT looks at a larger patch of pixels in order to get a lot more information squeezed into the descriptor. The patch of pixels is a 16x16 square immediately around the feature point with a total of 256 pixels. These 256 pixels per feature point could be an unwieldy amount of data. SIFT Summarizes the data of those pixels by using a handful of gradient histograms.

For instance, consider this 8x8 patch of pixels below:

The SIFT paper uses a 16x16 patch of pixels, the above 8x8 patch is reduced in size for illustration purposes

Instead of saving the gradient of each of those pixels, each quadrant can be summarized with a histogram:

The SIFT paper uses a 16x16 patch of pixels, the above 8x8 patch is reduced in size for illustration purposes

The Feature Descriptor will contain summarized orientation and magnitude (3D vector) information in the form of histograms for smaller blocks of pixels within the patch. The SIFT algorithm further weighs the importance of gradients by using a Gaussian distribution using a sigma that is derived from the scale which the feature point is located. This normalizes for scale while putting a larger emphasis on pixels that are closer to the feature point and less emphasis on pixels that are further from the feature point.

The result of the SIFT detector. The line through the middle of each circle represents the orientation of the feature descriptors while the size of each circle is dependent on the scale which the feature is located. (Original Image Source: Carlos Spitzer)

The last step in the SIFT detector is to normalize the magnitude data in the histograms. This makes the feature descriptor invariant to some types of (linear) brightness/contrast changes.

At this point, the amount of data for each feature point is similar in size — due to the fact that feature descriptors are summarized by the normalized histogram of gradients. Once they are “normalized”, we can directly compare descriptors from one image to another.

Beyond SIFT

While SIFT is one of the most influential Feature Descriptor algorithms, it’s certainly not the only one. Other descriptors include ORB, BRISK and FREAK along with many others. Computer Vision companies invest heavily in creating their own proprietary descriptor algorithms to get an edge over their competition.

This is where we start seeing some really amazing things in Computer Vision. Feature Descriptors enable us to make lots of reasonable comparisons between interesting points in an image and gain a greater understanding of what is happening on any particular image in reference to other images.

Given a predetermined object, we can now determine if the object is located within an image. We can also do fun things like image stitching — to make a large panorama photo out of lots of smaller, overlapping photos. Feature Descriptors also play a critical role in working with stereo camera systems. Feature descriptors opens up lots of doors when it comes to Computer Vision.

TLDR

Feature Descriptors are like Feature Points on steroids. Feature Descriptors considers the data immediately around a Feature Point in order to improve the robustness of matching descriptors across different images. The SIFT feature descriptor is an incredibly influential computer vision algorithm which uses various techniques in order to extract robust feature descriptors from images. SIFT starts with feature points detected through Difference of Gaussian. This results in features which are scale invariant. In order to make the feature descriptors orientation invariant, SIFT then considers a large patch of pixels surrounding the feature point and orients the feature based on the dominant orientation of the surrounding pixel gradients. The data around the feature point is then summarized in various histograms of gradients and included as part of the feature descriptor. Lastly, the magnitude of the histograms are normalized to make the descriptor invariant to linear changes in brightness or contrast.

Sources and More Info