Editor’s Note: Part 1 of this series was published over at Hacker Noon. Check it out here.

Welcome back to the second part of this series. In this section, we’ll dive into the YOLO object localization model.

Although you’ve probably heard the acronym YOLO before, this one’s different. For the purposes of this post, YOLO stands for “You Only Look Once”.

Why “look once” you may wonder? Because there are other object location models that look “more than once,” as we will talk about later. The “look once” feature of YOLO, as you already expected, makes the model run super fast.

YOLO model details

In this section, we’ll introduce a few concepts: some are unique to the YOLO algorithm and some are shared with other object detection models.

Grid cells

The concept of breaking down the images to grid cells is unique in YOLO, as compared to other object localization solutions. For example, in reality, one image can be cut to 19 x 19 grid cells. But for the purpose of this explanation, we’re going to use a 3 x 3 grid.

In the image above we have two cars, and we marked their bounding boxes in red.

Next, for each grid cell, we have the following labels for training. Same as we showed earlier in Part 1 of the series.

So how do we associate objects to individual cells?

For the rightmost car, it’s easy. It belongs to the middle right cell since its bounding box is inside that grid cell.

For the truck in the middle of the image, its bounding box intersects with several grid cells. The YOLO algorithm takes the middle point of the bounding box and associates it to the grid cell containing it.

As a result, here are the output labels for each grid cell.

Notice that for those grid cells with no object detected, it’s pc = 0 and we don’t care about the rest of the other values. That’s what the “?” means in the graph.

And the definition of the bounding box parameter is defined as follows:

bx: x coordinate, the center of the object corresponding to the upper left corner of the grid cell, the value range from 0~1,

coordinate, the center of the object corresponding to the corner of the grid cell, the value range from 0~1, by: y coordinate, the center of the object corresponding to the upper left corner of the grid cell, the value range from 0~1,

coordinate, the center of the object corresponding to the corner of the grid cell, the value range from 0~1, bh: height of the bounding box, the value could be greater than 1,

of the bounding box, the value could be greater than 1, bw: width of the bounding box, the value could be greater than 1.

For the class labels, there are 3 types of targets we’re detecting,

pedestrian car motorcycle

With “car” belonging to the second class, so c2 = 1 and other classes = 0.

In reality, we may be detecting 80 different types of targets. As a result, each grid cell output y will have 5 + 80 = 85 labels instead of 8 as shown here.

With that in mind, the target output combining all grid cells have the size of 3 x 3 x 8.

But there’s a limitation with only having grid cells.

Say we have multiple objects in the same grid cell. For instance, there’s a person standing in front of a car and their bounding box centers are so close. Shall we choose the person or the car?

To solve the problem, we’ll introduce the concept of anchor box.