With the advent of Convolutional Neural Networks (CNNs), we have made impressive progress in computer vision tasks like object detection, image segmentation, classification, etc. However, algorithms that truly understand images would also reason about the current scenes from what happened in the past, and predict what might happen next and how the objects relate to each other. To develop algorithms that have these capabilities, we need to go beyond deep, straightforward CNNs. To understand the dynamics and interactions between objects in visual scenes, we require modeling of sequential and relationship information to link objects in time and space. In deep learning, this can be achieved by recurrent neural networks and graph-based methods. However, before I introduce and explain these advanced methods, it is very helpful to first understand the evolution of the state-of-the-art object detectors and their limitations that need to be solved for further progress.

In this blog post, I will explore important work in deep learning for object detection. I will introduce how those methods evolved over time and compare their differences and similarities. By showing innovations and limitations of each method and how later methods progress to solve previous problems, you will see — which is quite interesting — that research gets improved each time by just a small but elegant adjustment.

Two-stage Methods: An Evolution of R-CNN Series

(1) R-CNN (Ross Girshick, et al., 2014): an initial step towards CNN-based object detection

R-CNN combines region proposals with CNN features and shows very promising results in object detection tasks. Region proposals are essentially candidate boxes with objectness scores which indicate how likely there is an object inside the bounding/candidate box.

R-CNN ￼￼core idea. Image source: Ross Girshick, et al., 2014

R-CNN works in a very simple way. It first uses selective search (a bottom-up method initialized by oversegmentation and iteratively group adjacent small segments to a hierarchy of successively larger regions based on the similarities) to generate ~2000 category-independent region proposals. Since most CNNs only work for fixed-size inputs, R-CNN resizes each region into the same size. Those warped images are passed through the CNN –(in this case, a modified AlexNet)– which contains 5 convolutional layers and 2 fully connected layers. Those features will finally be fed into a classifier which calculates a probability for each class at each region. R-CNN achieves mean average precision (mAP) of 53.7% on PASCAL VOC 2010 and 62% on PASCAL VOC 2012.

(2) SPPnet (Kaiming He, et al., 2015): the first step of feeding bounding boxes of varying-size into CNNs (unlike R-CNN where each bounding box is scaled/warped)

Let’s think about R-CNN again. In order to use fully-connect (fc) layers in CNNs, we need to make sure inputs have the same size. However, there are ~2000 region proposals per image with different sizes. R-CNN warps varying-size regions into regions with the same size. This means: for many small regions, for example, one has 1000 pixels, we have to rescale it into one has 5000 pixels. When we feed it into the CNN to compute feature maps, those 5000-1000=4000 extra pixels cost an additional computation, which leads to slow detections: ~13s/image on a GPU or ~53s/image on a CPU. In fact, this is the common technical issue in CNNs: they require a fixed input image size due to fully-connected layers.

SPPnet cleverly solves this limitation by adding the spatial pyramid pooling (SPP) into CNNs because it can output fixed-length representations for variable-size inputs. SPP was originally called spatial pyramid matching, which is one of the most successful methods for feature extraction and matching in computer vision. This idea was first proposed by Svetlana Lazebnik et al. in 2006, as an extension of the Bag-of-Words (BoW) model.

Spatial Pyramid Pooling Layer. Image resource: Kaiming He, et al., 2015

SPP works by partitioning the image into increasingly fine sub-regions and computing local features inside each sub-region. Let’s take a closer look at how to apply SPPnet for object detection. Like R-CNN, it first generates ~2000 region proposals per image, but instead of warping them into the same size and generating feature maps for each region proposal, SPPnet only generates feature maps from the entire image. SPPnet then “projects” the candidate boxes onto the final feature maps of the convolutional layers of the entire image, in this way only a single pass through the convolutional layers is needed. However, after we crop the feature maps for each candidate box, we need to make sure they have the same size before feeding them into fully-connected layers. That’s when SPP layer comes in.

Let’s say we have three regions with different sizes (as shown above). if we take one and divide it into 4 equal parts and only get the maximum value (max pooling) from each part. We will end up getting 4 numbers, regardless of the size of the image. If we use SPP that divides a bounding box into a fixed number of parts, for example, we always divide each region into 36 parts, then we call this procedure Region of Interest Pooling with 6×6 split. If we select multiple splits, for example, we repeat this procedure as 3×3 split and 6×6 split, we call this a 2-level spatial pyramid with 3×3 and 6×6 pooling. After max pooling, we put the maximum number into a vector with the size of 3×3+6×6 = 45.

In SPPnet, it defines 4 sub-regions (a 4-level spatial pyramid), which have sizes of 1×1, 2×2, 3×3 and 6×6. If we follow the procedure described above, we will end up getting a vector of 50 (1×1+2×2+3×3+6×6=50) for each convolutional channel. There are 256 channels so in total we will get a feature representation with a fixed-size of 256×50, regardless of region sizes. With this step, we can use those fix-size features for fully-connected layers and feed them into a classifier, for example, a binary linear SVM classifier. In this way, we only run the CNN once, but share the features across region proposals. Compared to R-CNN, SPPnet is 24~102x faster at test time.

(3) Fast R-CNN (Ross Girshick, 2015): a fast version of R-CNN

Now, we have the SPP layer to generate fixed-length representations and improve the efficiency of R-CNN, but both SPPnet and R-CNN still have to train three modules separately: a CNN module (to get features), a classification module (to obtain category scores), and a regression module (to tighten the bounding boxes). This makes training very inefficient.

Fast R-CNN architecture. Image source: Ross Girshick, 2015

The main contribution of Fast R-CNN is that it enables an end-to-end training by hierarchical sampling and multi-task loss. During the training, stochastic gradient descent (SGD) mini batches are sampled hierarchically: first by sampling N images and then by R/N sampling proposals from each image, where N is the minibatch size and R is the total number of images. In this way, region proposals from the same image can share computation and memory in the forward and backward passes. The multi-task loss essentially combines classification loss and bounding-box regression loss together so that it can jointly optimize a softmax classifier and bounding-box regressors. There are more details in Ross Girshick, 2015.

Fast R-CNN uses the Region of Interest (RoI) Pooling, which as we described above, is spatial pyramid pooling with a 1-level pyramid. Fast R-CNN improves R-CNN on PASCAL VOC 2012 from a mAP of 62% to 66%, with 9x faster training speed and 213x faster at test time. Compared to SPPnet, Fast R-CNN can train VGG16 3x faster and test 10x faster.

2×2 RoI pooling layer example for a bounding box. Image resource: deepsense.ai.

(4) Faster R-CNN (Shaoqing Ren, et al., 2016): towards real-time object detection

From what we have talked above, you can see that for two-stage object detectors, we need to first generate region proposals and get ideas of where are the candidate locations, then we apply techniques on those locations to get final detection. In R-CNN and fast R-CNN, the region proposals are all generated by Selective Search, which is about 2s/image. Besides Selective Search, SPPnet also uses EdgeBoxes (another method for object proposals which uses the number of contours as indications to calculate the likelihood of the box containing an object) but it still consumes 0.2s/image. This exposes region proposal computation as a bottleneck. We need to get more reliable proposals with faster speed.

The main structure of Faster R-CNN. Image source: Shaoqing Ren, et al., 2016

Faster R-CNN is an important step towards real-time object detection. It is built on the top of Fast R-CNN but replaces Selective Search algorithm with a Region Proposal Network (RPN) to generate region proposals. This network takes an image of any size as input and computes a set of object proposals in about 10ms per image.

Explanation of Region Proposal Network. Image resource: Shaoqing Ren, et al., 2016

In RPN, we slide a network which takes as input a 3×3 spatial window over the convolutional feature map generated by the last convolutional layer. At the center of each sliding window, RPN simultaneously predicts k=9 bounding boxes with different various aspect ratios, which we call anchors. We get features from each anchor and feed them into a regression layer and a classification layer. In particular, the regression layer outputs: box’s center coordinates (x, y) and its width w and height h, so each anchor will be associated with 4 coordinates and in total 4k coordinates for all the anchors. The classification layer is a two-class softmax layer gives 2k scores which predict if there is an object or not inside the anchor. In this way, RPN produces a set of bounding boxes associated with objectness scores (2k scores) and box locations (4k coordinates). The RPNs are unified with Fast R-CNN object detection networks and can be trained end-to-end. However, Faster R-CNN is still below real-time in speed. It can achieve 140ms/image (7fps) with VGG16 and 55ms/image (18fps) with ZF.

(5) Mask R-CNN (Kaiming He, et al., 2017): towards accurate pixel level segmentation

So far, we have talked about object detection in the format of bounding boxes. However, bounding boxes only provide rough locations of objects. For many robotics tasks, it is not only important to detect where the object is, but also how to interact with it. This requires a fine segmentation of the object in the bounding box. You can also imagine that if two objects are really close to each other, their bounding boxes might overlap a lot. The system needs to figure out one bounding could partially cover another object.

Mask R-CNN framework. Image resource: Kaiming He, et al., 2017

Mask R-CNN addresses this segmentation problem by adding a Fully Convolutional Network (FCN) that runs parallelly as a second branch to the basic Faster R-CNN framework. This new branch focuses on predicting pixel-level segmentation masks on each region of interesting, while the old branch from Faster R-CNN still focuses on classification and regression of coordinates for anchors.

However, if we completely follow the technique in Faster R-CNN, we will have one problem in Mask R-CNN: In Faster R-CNN, we use region proposals to crop feature maps, while feature maps are usually much smaller than original images (e.g.: original image is 256 x 256, feature map could be 32 x 32). To “project” a region of interest to a smaller scale, we just need to calculate the ratio between two sizes and find the new locations in proportion. Usually, the new calculated location is a decimal number, we round this number into an integer (first quantization). Then RoI pooling layer will quantize it again through pooling (second quantization) to fit into a fixed size region. These coarse spatial quantizations create misalignments between the RoI and the extracted features. This is not a problem for Faster R-CNN since it is not designed for pixel-to-pixel alignment between inputs and outputs, but it is crucial for constructing the mask branch.

Bilinear interpolation in RoIAlign. Image resource: Kaiming He, et al., 2017

Mask R-CNN solves this problem by proposing a new layer called RoIAlign. Instead, it keeps the floating number of the calculated location after “region proposal projection”. Then we divide the region into bins. For each bin, Mask R-CNN uses bilinear interpolation from the nearby grid points on the feature map to create 4 sampling points. Then we just need to take the maximum number of those 4 points in each bin and finish the max pooling.

So far, we’ve talked about R-CNN family tree: a series of methods following two-stage framework — (1) a region proposal stage (2) a detection stage. However, this type of methods usually suffers from problems in speed and training.

Is there any way to simplify the framework?

Single Stage Methods — Detection without Proposals

(1) YOLO (v1): You Only Look Once (Joseph Redmon, et al., 2016)

The architecture of YOLO. Image source: Joseph Redmon, et al., 2016

YOLO (v1) is the first real-time object detectors which can achieve 45fps speed on a Titan X GPU, and its faster version can achieve 155fps (tested on PASCAL VOC 2007). Different from two-stage methods, the core idea behind this fast detector is a single convolutional network consisting of convolutional layers followed by 2 fully connected layers, which simultaneously predicts bounding boxes and class scores.

Key steps in YOLO. Image resource: Joseph Redmon, et al., 2016

YOLO takes the entire image as an input and first divides it into an S by S grid. At each cell of this grid, YOLO predicts C conditional class probabilities. Each grid cell also predicts N bounding boxes and N corresponding objectness scores which tell you if there is an object inside each bounding box. At test time, the conditional class probabilities are multiplied with objectness score at each box. This gives class-specific scores (they are different from objectness scores) for each box.

YOLO reasons globally about the image. This is different from region proposal-based methods which only consider features within the bounding boxes. This capability makes YOLO have less than half the number of background errors than Fast R-CNN.

Since 2016, Joseph Redmon has been working on how to improve the accuracy of YOLO. There are three versions in total (up to today). I will talk about the other two later in this post.

(2) SSD: The Single Shot MultiBox Detector (Wei Liu, et al., 2016)

The Single Shot MultiBox Detector (SSD) makes an important contribution in object detection area. It significantly outperforms previous work (on PASCAL VOC 2007): 59fps with mAP 74.3%, vs Faster R-CNN: 7fps with mAP 73.2% or YOLOv1: 45fps with mAP 63.4%.

SSD framework. Image resource: Wei Liu, et al., 2016

Interestingly, we could understand SSD by comparing it with Faster R-CNN and YOLO. Similar to anchors, which are predicted through sliding a network on the feature map in Faster R-CNN, SSD also uses a network to predict a set of bounding boxes over various aspect ratios with each cell on the feature map. However, instead of using these boxes to crop features and feed them into a classifier, SSD simultaneously produces a score for each object category in each box, More specifically, for each box, we predict both the box locations and classification scores.

A comparison between SSD and YOLO. Image source: Wei Liu, et al., 2016

It’s also very interesting to compare SSD with YOLO(v1) since they both are early-stage single shot detectors. SSD is built on top of a base network (VGG in the early version and Residual-101 in some later versions) but add extra multi-scale convolutional feature layers, which are similar to skip connections in ResNets. Those layers play significant roles: they decrease in size progressively, so they allow predictions of detection at multiple scales. This means we could use the higher resolution feature maps to detect small objects and lower resolution feature maps to detect big objects. SSD also achieves high-accuracy using relatively low-resolution input (using low-resolution inputs could increase detection speed). If you now recall YOLOv1, it only operates on a single scale feature map, as mentioned in Joseph Redmon, et al., 2016, this is one of the limitations of YOLOv1: it struggles to generalize to objects in new or unusual aspect ratios.

(3) The Extension of SSD and YOLO:

YOLO and SSD have been encouraging more work in object detection using single stage methods. Since then, more progress has been made on top of these frameworks and achieve state-of-the-art performance. Some important works include:

*SSD with Recurrent Rolling Convolution (Jimmy Ren, et al., 2017).

Let’s recall SSD again. SSD uses multi-scale feature layers, and feature maps in each layer are independently responsible for the output of its scale. The final detection can be generated by integrating all the intermediate results from each feature layer. This assumes each feature layer has to be independently good enough to represent fine details of the objects. Jimmy Ren, et al. propose Recurrent Rolling Convolution (RRC) to improve this “independence” by sharing computations across all the feature layers and meanwhile keep each intermediate result for final detection.

* DSSD: Deconvolutional Single Shot Detector (Cheng-Yang Fu, et al., 2017)

The Deconvolutional SSD mode is built on top of SSD but replaces VGG with residual-101. It also adds extra deconvolution layers to successively increase the resolution of feature map layers. With this improvement, it can achieve 81.5% mAP on VOC2007 at 6fps.

*YOLO v2 (2017) & YOLOv3 (2018): Better versions of YOLO

How to improve YOLO from 63.4 mAP to 78.6 mAP. Image source: YOLO v2 (2017)

Joseph Redmon keeps improving YOLO1 by announcing two better versions in 2017 and 2018. In YOLO2, it gets 76.8 mAP on VOC2007 at 67fps. It improves YOLO1 by adding a variety of ideas which are summarized in the above table. Just a few weeks ago, YOLO3 was announced in a very interesting way and I encourage you to take a look at its original paper.

(4) RetinaNet (Tsung-Yi Lin, et al., 2018): The Focal Loss!

Comparison among state-of-the-art detectors. Image resource: Tsung-Yi Lin, et al., 2018

Now, let’s make a quick comparison of the accuracy between two-stage detectors and single-stage detectors: at COCO dataset, the most accurate two-stage detector is the Feature Pyramid Network (FPN), which achieves 36.2% AP, while the most accurate single-stage detector is Deconvolutional Single Shot Detector (DSSD), which only achieves 29.9% AP.

From the chart above, we can see that single stage detectors like YOLO, SSD, DSSD (and their extensions) generally have lower accuracy than two-stage detectors. This is because of something called class imbalance. Two-stage detectors use methods like Selective Search, Edgeboxes or RPN to generate 1k-2k candidate boxes at the first stage which can filter out many background samples to make sure candidate boxes are good enough before we feed them into the second stage of classification. However, for single stage detectors, since the network has no idea of where are the boxes, they have to predict N boxes at each window. In order to densely cover the space, scale and aspect ratios, there are usually ~100k candidate locations in total for one image. Those locations contain many background regions which are easy for the model to classify correctly as no object. However, since we are averaging the loss over all bounding boxes, the background regions contribute to the learning equality to bounding boxes. This makes training inefficient because we do not need to learn what we already know. In order to make the network truly learn different classes, we need to have some focus: the network needs to focus more on hard samples while spending less time on easy samples like backgrounds.

To address this class imbalance problem, RetinaNet proposes a new loss function called focal loss, which adds a probability-dependent weight to modulate cross-entropy loss.

Where CE(p,y) is the binary cross entropy loss, p is the predicted probability, y is the ground truth label, γ is a fixed number (γ = 2 works best in RestinaNet). You can intuitively understand this equation in this way: if the model is confident about one image, the weight of the loss on this image would almost be zero. This means: since this sample is not that useful for learning, we need to scale it down even further. However, if the model sees one object for the first time, it is not confident of the estimation, the weight of the loss on this sample would be higher and the model knows this sample is more important.

With this new loss, the model could focus more on hard samples and focus less on backgrounds. This makes RetinaNet outperforms previous single stage detectors with an accuracy of 37.8AP on COCO dataset but with a slow speed: ~5fps.

Discussion Part 1: Which detector is better?

We usually need to find a good trade-off between speed and accuracy among detectors. The following chart shows a comparison among state-of-the-art detectors.

Comparison of different detectors on speed and accuracy on COCO dataset. Image resource: YOLOv3 (2018)

From the chart, YOLO3 works with the fastest speed but with the lowest accuracy: 22ms/image with 28.2 mAP, while single-stage RetinaNet detector achieves the highest score but with the slowest speed: 198ms/image with 37.8mAP.

We also need to take the size of the objects in to account. The chart below compares performances among YOLO2, SSD, and Faster R-CNN on different object sizes.

Comparison among YOLO, SSD and Faster R-CNN on various object sizes. Image source: cv-tricks.com

We can try to understand this chart by looking at their network structures. We have talked about the limitations of SSD and YOLO before. For Faster R-CNN, it uses multiple pooling layers, this means by the time the feature map is propagated to the last layer, the resolution is much smaller. With a low resolution, detectors can hardly find details of the small objects.

Many methods I talked above have released their opensource code, I encourage readers to try them on!

Discussion Part 2: Challenges

In this discussion, I want to share my thoughts on what are the remaining challenges in object detection.

(1) Unseen Object Categories

Could the detector know the objects that it has not seen before? Image source: zero-shot detectors

Most of the object detector depends on heavy training on the dataset, but if the computers don’t see objects during training, they have no idea of what and where is the object. For humans, we sometimes could compare physical properties of this new object with our knowledge base and give a rough guess. Recently, many works like zero-shot detectors, semi-supervised learning are proposed to overcome this limitation using labels as few as possible but it still remains a very difficult task.

(2) Reason different objects in the same category

Can robots distinguish those objects by their unique properties? Image resource

As a researcher in vision and robotics, one of my goals is to make robots truly assist people in our daily lives. Sometimes, we need robots to reason about objects in the same category and pick up what humans really need. If I tell a robot: I need a plate to put this cupcake. I hope the robot could think about the size of the cupcake and pick up a small plate. Similarly, I hope robots can reason the same type of objects but in different colors, shapes, patterns, locations, materials and etc.

(3) Task-driven Detections:

Attention is always very interesting to me. For humans, our attention can vary a lot even on the same image. I think this is because we have different (a) purposes and interests (b) backgrounds and past experiences (c) emotions and so on. I wonder if intention-driven object detection could make object detection smarter. My definition of being intention-driven is: directly or indirectly detect those objects which we need to pay attention based on what is our goal. For example, if our goal is to park the car near a shop but there is no empty spot available. Can the system pay attention to (a) people who just came out of the shop and walking to the car (b) the car with people in it? I can imagine it is also important for medical image analysis: can the system localizes some part of the image that needs to be paid more attention? This could help doctors diagnose patients faster.

Of course, there are still many unsolved topics in this area and there is still a long way towards human-level object detection.

Discussion Part 3: Applications

Object detection is widely used for many research areas. In robotics, object detection is the fundamental step because a robot to find where are the things that we need in order to finish a task. For example, a kitchen robot wants to cook some pancakes, it has to detect where is the pan, oven, eggs, flours, and so on. Object detectors in self-driving cars help to detect pedestrians, traffic signs, vehicle and so on. It is also usually used with object tracking in surveillance to detect suspects and uncommon scenes.

Object detection for self-driving cars. Image resource: Shutterstock

Object detection for robotics: a robot view on kitchen dataset. Image resource: robaid

Acknowledgment:

I would like to thank Tim Dettmers and Christian Osendorfer for their valuable advice!

If you like this post and would like to be the first to read my next blog post, you can follow me on twitter (@ZoeyC17). I will post updates about my next blog posts which will give an introduction to Graph-based Convolutional Networks and their applications to visual problems that require the understanding between objects in a scene.