Challenging MSCOCO test image showing detection results at 11 Hz with a Titan X (Maxwell)

Here we introduce a modern alternative for CNN based bounding box object detection called DeNet which addresses some of the issues with two-stage object detectors (e.g. Faster/Mask RCNN, R-FCN, FPN, etc). In particular, we consider slow evaluation rates relative to single-stage detectors (YOLO, SSD, RetinaNet, etc), poor bounding box localization and manual tuning requirements. This work was presented at ICCV2017 and CVPR2018, with source code available from https://github.com/lachlants/denet

MSCOCO MAP vs Inference Time comparison with a Titan X (Maxwell) GPU [2]

Region Proposal Networks

Most object detectors are designed around an anchor-based region proposal network (RPN). With two-stage designs this RPN identifies which bounding boxes are likely to contain an object (i.e. proposals) then feeds these proposals into a 2nd network for classification and fine bounding box tuning. Single stage methods simply perform the classification within the RPN, negating the need for a 2nd stage, resulting in improved evaluation rates… often at the cost of localization performance.

Ideally, the RPN is formulated such that the network estimates Pr(object | bbox), i.e., given a bounding box we estimate the probability that it contains an object of interest. However, the number of possible bounding boxes given a standard sized image is extremely large, e.g., given a 640x480 image there is greater than 10 billion possible bounding boxes (considering only integer coordinates). It is clearly not tractable for a neural network to estimate the probability for every bounding box, therefore we rely on an engineer to select a much reduced subset of bounding boxes, called anchors. Selecting an appropriate set of anchors is dataset dependent (depending on scale and aspect ratio of objects), with more anchors typically improving localization performance at the cost of evaluation rate. State-of-the-art two-stage RPN based detectors use approximately 10–100K anchors with 1000x600 or 1200x800 input images.

Going Anchorless…

In DeNet [1] we demonstrated an alternative formulation for the RPN, instead of directly estimating whether an anchor contains an object, we estimate the corners of object bounding boxes and perform a simple naive search over them to obtain the proposals. That is, our CNN estimates Pr(corner_type | x,y) where corner_type is either top-left, top-right, bottom-left or bottom-right and (x,y) is a position within the image. We then match the most probable top-left corners with the most probable bottom-right corners and rank the resulting bounding boxes via a simple Naive Bayes classifier:

Pr(object | bbox) = Pr(top_left| x0, y0 ) Pr(top_right| x1, y0) Pr(bottom_left| x0, y1) Pr(bottom_right| x1, y1)

This process is repeated by matching the top-right and bottom-left corners, after which the top N bounding boxes are selected to form the proposals. With this formulation our implementation can efficiently select from a set of ~67 million bounding boxes using 512x512 input images without an engineer selecting suitable aspect ratios / scales.

This novel RPN method improved the localization of the proposals, making them significantly more likely to overlap objects of interest nicely. However, this advantage was only partially realized because the better fitting proposals were not neccessarily being selected after the final classifier. We addressed this issue with Fitness NMS [2] described below.

Selecting the best proposal…

The 2nd stage of a modern two-stage object detector takes a proposal and an associated feature vector (discussed later) and estimates an updated bounding box and the class probability of the object within, i.e., Pr(class | proposal). These probabilities and bounding boxes are then fed into a Non-Max Supression (NMS) algorithm to identify unique object instances since we want only a single detection to be created for each object present. In practice, the standard algorithm selects bounding boxes which have the greatest confidence in their class… we found this to be a poor discriminator between bounding boxes with better/worse overlap.

With Fitness NMS [2] we augment the NMS scoring function with an estimate of the intersection-over-union (IoU) overlap of the proposal and the object contained within, i.e., score = Pr(class | proposal)*estimated IoU. This forces the NMS algorithm to select detections with both high estimated IoU and class probabilites. We estimate the IoU by simply modifying the final classifier to produce Pr(class, IoU| proposal) where IoU is discretized into a series of bins.

Pascal VOC 2007 results, same model with / without Fitness NMS [2]

MSCOCO results with / without Fitness NMS [2]

These results demonstrate that Fitness NMS improves localization at high IoUs. It remains to be seen whether Fitness NMS will work with other detectors, or whether it is only applicable to the DeNet design with its unique proposal distribution.

Nearest RoI Pooling

Once a proposal has been identified by the RPN, an RoI pooling operation is performed to extract a fixed-length feature vector which represents the contents of the proposal for the final classification stage. In most two-stage detectors this is performed by breaking the proposal into an NxN grid then performing a mean pooling operation within each grid cell on a selection of feature maps from the base RPN network.

In DeNet[1] we proposed replacing the mean pooling operation by simply selecting the feature which is nearest to each vertex of the grid. To aid in pooling performance we use a relatively high resolution final feature map. This operation proved to be significantly cheaper without any major drawbacks in terms of MAP, enabling our two-stage object detector to outperform even single-stage detectors.

Conclusion

Our corner-based proposal network enables us to provide high quality proposals with much smaller input images which, combined with the RoI pooling method, enables us to provide a well localized object detector that is significantly cheaper to evaluate than existing anchor based single-stage or two-stage detectors.

With these novelties (and a few others described in the papers) I obtained the following MSCOCO results. Note that the detector was trained at 512x512 resolution, then evaluated at the other resolutions… fine tuning at higher resolutions may provide additional benefits.

MSCOCO results [2] for Titan X (Maxwell) GPU

For full details please refer to the papers and source code linked below:

[1] DeNet: Scalable Real-time Object Detection with Directed Sparse Sampling, https://arxiv.org/abs/1703.10295

[2] Improving Object Localization with Fitness NMS and Bounded IoU Loss, https://arxiv.org/abs/1711.00164

Source code: https://github.com/lachlants/denet