The main contribution in this paper is a seediness score which is learned for each pixel. The score tells us if the pixel is a good candidate to expand a mask around. In the previous paper the seed was chosen at random and then the center was refined using the mean-shift algorithm. Here, only one expansion is made.

seedeness score per pixel. taken as the maximum over all classes and bandwidths.

The paper propose to learn several possible seeds for each pixel. We learn a seed for each radius (in embedding space) and class. So if we have C classes and we learn T bandwidths (radii) we have CxT seed “proposals” per pixel. For each pixel only the proposal with the highest score is considered.

Embedding Loss. In this paper, the embedding is penalized for pairs of pixels. We consider pairs that are of the same instance and pairs from different instances.

a logistic distance function in embedding space

The paper uses a modified logistic function that transforms the euclidean distance in embedding space to the [0, 1] domain. Pairs that are close in the embedding space will be assigned a value close to 1 by the function, pairs that are distant will approach 0.

Naturally, logloss is used as a loss function. Instances sizes may vary so, in order to mitigate this imbalance issue, pairs are weighted with respect to the size of the instance they are a part of.

logloss over logistic distance between pairs of pixels

Seediness Loss. For each pixel, the model learns several seediness scores. One score for each combination of bandwidth (radius in embedding space) and class. Since the seediness score is close but not the same as semantic segmentation, the ground truth for each is determined every time the embedding is evaluated. A mask is expanded around the embedding of a pixel and if the IoU with a ground truth instance exceeds a certain threshold, the pixel is considered as a seed for the class of the instance. The loss will then penalize a low seediness score for this class.

seediness loss

Only 10 or so seeds are evaluated per image in each batch, picked randomly. Several such models are learned, one for each bandwidth. The wider the bandwidth, the larger the object. In a way, the bandwidth that received the highest score, is the model’s way to convey it’s estimation to the instance size (with respect to the distances in the embedding space).

Training Procedure. The paper uses ResNet-101 backbone pretrained on the COCO dataset. Training starts with no classification/seediness predication i.e. λ=0 and progresses to 0.2 as the embedding is more stable.

The backbone is evaluated at different scales (0.25, 0.5, 1, 2) and the concatenated results are fed to the seediness and embedding heads.

Parsing. The procedure pretty straight forward since the seeds learned. The paper proposes a procedure to pick the best seed set for an image. It optimizes for a high seediness score on one hand and diversity in the embedding space on the other.

Seeds are chosen iteratively, each new seed is chosen to be distant in the embedding space from the previously selected seeds. The first seed selected is the pixel with the highest seediness score in the image. The second one will be the seed that on one hand has a high seediness score and on the other hand is not close in the embedding space. The balance between the two requirements is controlled using the parameter alpha. Alpha is a to be tuned, the range tested for this parameter is between 0.1 and 0.6. Unlike NMS, diversity in embedding space is encouraged, rather than spatial diversity.

some results from Semantic Instance Segmentation via Deep Metric Learning

Paper 3: Recurrent Pixel Embedding for Instance Grouping

Shu Kong, Charless Fowlkes

https://arxiv.org/abs/1712.08273

https://github.com/aimerykong/Recurrent-Pixel-Embedding-for-Instance-Grouping

This paper proposes to have the embedding on a n-sphere and to measure proximity of pixels using the cosine distance. However, the main contribution is this paper is the recurrent grouping model, based on a modified version of the Gaussian Blurring Mean-Shift (GBMS) algorithm.

GBMS is an iterative algorithm like the simple mean-shift algorithm used to find instance centers in the first paper. In this version, all the pixels are considered to be potential seeds. All pixels are updated at each iteration with respect to the density around them. Moving toward a “center of gravity”, as if the embedding space of the image was a nebula producing planets. The farther points are from each other, the less of the effect they will have on one another. The distance is controlled by the bandwidth of the Gaussian, it’s standard deviation, as is clear from the algorithm below.

For GBMS there are cubic convergence guarantees so eventually we should get very dense, almost point-like, clusters after applying the transform several times. For more on GBMS see here.

In order to incorporate the algorithm in the network, it has be expressed using operations on matrices.

Simply applying the algorithm described above is does not make sense since the embedding are on a sphere and their proximity is measured using the cosine transform. The affinity matrix, describing the distances between all points is calculated using the following transformation:

Measuring distances on the sphere, rather than using the L2 norm. In addition, after applying a GBMS step, it is required to normalize the resulting embeddings so they will be on the unit sphere.

Training. Pairwise pixel loss is used, similarly to the previous paper with a threshold on the distance required from dissimilar pairs (alpha). Each pair is evaluated using a calibrated cosine distance which ranges [0,1] instead of [-1, -1].

calibrated cosine distance

The loss is back-propagated through each application of the recurrent grouping model. Later stages of application will surface only very difficult cases. The authors compare this property to hard negative mining used in the training of Faster-RCNN, for example.

loss used in Recurrent Pixel Embedding for Instance Grouping

The authors are using 0.5 as value for alpha in the paper. Notice that the size of the instance is used to re-balance the loss between small and large instances.

Parsing. After several applications of the grouping module, the clusters should be very dense, picking values at random should produce good enough seeds.

For practical purposes, it makes sense to use only some of the pixels in the GBMS steps since computing the similarity matrix might prove prohibitively expensive. The amount of pixels taken is a speed/accuracy trade-off consideration.