Mid-Level:

Mid-level computer vision techniques tie images with other images and the real world. These often build upon low level processing to complete tasks or prepare for high level techniques.

Panorama Stitching:

One clear example of combining images with other images is panorama stitching. Additionally we know some real world information, such as how a phone is usually turned, so we can warp and combine multiple images together.

Multi-view Stereo:

This is another image combination technique that we can use our knowledge of the outside world to do. As we have two eyes, we see in 3D and perceive depth. By comparing two images, we can assess which parts of an image move a lot, and which move less, in order to judge depth.

Structured Light Scan:

Using patterned light emitters and receivers, we can construct high dimensional models of real world objects from the curvature in the patterns received. Once again, this is a combination of real world knowledge and multiple images to get the desired result.

Range Finding:

As with structured light scans, light emitters can be used alongside cameras to analyze the world. The difference here is that range finding aims to judge the distance between the camera and an object rather than building a 3D model.

This is particularly useful for use in self-driving cars, for example. Emitters are attached to the car, and the camera can judge distances from the time it takes the light to reflect back into the camera. Laser light is used—hence, it’s commonly called LaDAR and LiDAR (Laser Detection And Ranging and Light Detection And Ranging, respectively).

Optical Flow:

In a similar fashion to creating multi-view stereo images, differences between images can be used for optical flow. Instead of using two images from slightly different positions, frames in a video are used. By comparing which parts of an image have the biggest differences in frames of a video, we can construct an optical flow. This is extremely useful for object tracking (and therefore image tagging) as objects move between frames.

Time-lapse:

The final mid-level computer vision technique Joseph Redmon covered in his first lecture was time-lapse creation. This seems like a relatively simple process, in which many frames over time are combined. But it’s actually more complex than I assumed. Inconsistencies and variation such as lighting differences, snowfall on one day, or objects like cars stopping in front of the camera need to get smoothed out to create a fluid time-lapse.

Uses:

Some of these mid-level computer vision techniques tie images together into a final state. For example, panorama stitching, time-lapse creation, and video stabilization are used for no other reason than to create their output.

Optical flow, however, is often a preliminary step to assist object tracking or content-aware resizing so important parts of video frames can be detected.

High-Level:

Computer vision techniques that are considered high-level bring semantics into the process. Extracting meaning from images is much more complicated but relies heavily on pipelines of low- and mid-level computer vision techniques.

Image Classification:

Grouping images into categories is known as image classification. The CV pipeline is given an image and is then categorized into a bucket depending on the task. In the case of an emotion detector, the buckets would represent different emotions. Then each image is tagged with the emotion that the pipeline predicts is shown in that image. Another related use of this is object detection, as shown in the image and discussed below.

Object Detection:

Extending on image classification, object detection doesn’t just return what is in the image but also where it thinks it is! This is an important distinction and very difficult to do in a reasonable timescale.

Detecting that a person is in front of a self-driving car 3 seconds after they step in front of you is of course not fast enough. Joseph Redmon is well known in this field, as he developed YOLO (You Only Look Once), an extremely fast object detection algorithm, which you can check out in the video below:

Semantic Segmentation:

Also similar to image classification is semantic segmentation. Building upon low-level segmentation, this is essentially classification at a pixel level. It’s clear how optical flow and range finding are used to classify the segments in the image below:

Instance Segmentation:

Similar to semantic segmentation, instance segmentation classifies pixels. The difference between the two is that instance segmentation can recognize multiples of the same object.

For example, illustrated below, semantic segmentation classifies chairs, a table, a building, etc. In the instance segmentation example, however, each chair is highlighted separately. This is useful in self-driving cars, as it can distinguish between multiple vehicles in one image.