What is Panoptic Segmentation?

To really understand what Panoptic segmentation is, there’s a fair few ingredients that we need first.

The simplest way to explain panoptic segmentation is to say it’s a combination of instance and semantic segmentation, but if those two concepts mean absolutely nothing to you, as they did to me when I first saw them, then let me guide you through those two tasks first. But first, I’ll need to start off with….

Object Detection

To get into instance segmentation, it’s important we cover what object detection is briefly. Let me give you an example of what object detection is with a cute picture of some kitties:

image courtesy of pexels.com/@deathless

So if we were to run this picture through an object detection machine learning algorithm, we would want our algorithm to detect all three cats, by correctly classifying them and then correctly identifying where these cats are located.

Our ground truth (where we have marked the cats as being located) and the ideal prediction for our algorithm would look something like this:

Cats with bounding boxes

The task for object detection would then be to accurately predict these cats and the corresponding bounding boxes. In the prediction process, each of these predictions would be accompanied with a confidence score, which is a probability score for how likely our algorithm believed each object was a cat.

This probability score is a probability for all classes, so if you take any one of the predicted instances as an example, we could have a probability distribution that looks something like this:

classes = [“cat”, “dog”, “bicycle”, “nothing”] prediction = [ 0.8 , 0.1 , 0.05, 0.05 ]

The algorithms output would also require a coordinate system in order to produce the bounding box around our object, which for each of our predictions, could have an output of something similar to:

legend = [ “X-Position", "Y-Position", "Length", Height”] prediction = [ 130, 285, 100, 185 ]

The X & Y positions above represent the midpoint of the object and the bounding box is then produced by extracting the length and height which is anchored at that midpoint.

While the probability outputs and the bounding box output are combined for our final output prediction, it’s important to know that they are performing two separate tasks. The probability output is performing classification, while the bounding box is performing regression. To understand the difference between the two, you can check out this article.

Ok! So to summarize:

In an object detection task, we are trying to get an algorithm to predict the class and bounding box location of each instance in our image.

So now that we have that understood, it’s only a small step to instance segmentation. They say an image can tell a thousand words, so let me show you what it is:

Instance Segmentation

Instance Segmentation on our cat image

Instance segmentation takes object detection a step further. Rather than simply asking our algorithm to draw a box around our instances, we now want it to identify which pixels belong to that instance too.

So building on top of our object detection task, our instance segmentation algorithm must now predict 3 things:

A class label

classes = [“cat”, “dog”, “bicycle”, “nothing”] prediciton = [ 0.8 , 0.1 , 0.05, 0.05 ]

2. A Bounding Box

legend = [ “X-Position", "Y-Position", "Length", Height”] prediction = [ 130, 285, 100, 185 ]

3. A Binary Mask

Binary Mask of one of the cats

Each instance we predict will produce a similar binary mask (a 2D array), that has a data point representing the same pixel width & height of the image.