Last week a Facebook AI Research team led by Kaiming He released the paper PointRend: Image Segmentation as Rendering, introducing a new “point-based rendering” neural network module with an iterative subdivision algorithm that can integrate SOTA image segmentation models.

As a general module, PointRend achieves higher sharpness on tricky object boundaries such as fingers, and can be added on both semantic and instance segmentation.

Instance segmentation is an important computer vision task. Traditional methods take an input image, infer the instance label to which each pixel belongs, and distinguish pixels belonging to different instances. Such methods however tend to over-compute smooth pixel-difference boundaries, resulting in misclassification of edge pixels and blurred contours.

Convolutional neural networks (CNNs) used for image segmentation tasks usually run on regular grids with inputs as a regular grid of image pixels with hidden representation based on the feature vector of the regular grid, and outputs as a regular grid-based label map. Regular grids are convenient, but not necessarily perfect for image segmentation in terms of computation. The label map predicted by these networks should be basically smooth, that is, neighbouring pixels usually sharing the same label, because the high frequency region is limited to sparse boundaries between objects.

Regular grids may oversample smooth regions while undersampling object boundaries, so that the outline of the prediction result becomes blurred. That is why image segmentation methods usually predict labels based on a low-resolution regular grid — such as 1/8 of the input in semantic segmentation or 28 × 28 in instance segmentation — as a compromise between undersampling and oversampling.

PointRend treats segmentation as a rendering problem, using a segmentation strategy to adaptively select a set of non-uniform points and calculate labels. PointRend can be incorporated into common instance segmentation meta-architectures such as Mask R-CNN and semantic segmentation meta-architectures such as FCN.

When computing high-resolution segmentation maps, PointRend’s subdivision strategy is an order of magnitude more efficient in floating-point operations compared to direct computation. PointRend does not perform predictions on all points of the output grid, only on carefully selected points.

PointRend outperforms the Mask R-CNN default head on instance segmentation and semantic segmentation tasks on both the COCO and Cityscapes benchmark datasets.

The number of floating-points and memory used by the subdivision inference strategy are less than 1/30 of the default 4 × convolution head, but allow PointRend to obtain high-resolution predictions (224 × 224).

PointRend has three main components:

A Point Selection Strategy selects a small number of true value points to perform prediction to avoid over-computing all pixels in the high-resolution output grid. A Point-Wise feature Representation which is extracted for each selected point. A Point head small neural network, independent of each point, which predict labels based on point-by-point features.

Kaiming He has done earlier groundbreaking work in the field of semantic segmentation and instance segmentation, for example proposing a panoramic segmentation method for instance segmentation and a TensorMask method for semantic segmentation.

The paper PointRend: Image Segmentation as Rendering is on arXiv.