A Comprehensive Guide To Object Detection Using YOLO Framework — Part I

The theory behind YOLO, network architecture and more

Cover Image (Source: Author)

Table Of Contents:

Introduction

Why YOLO?

How does it work?

Intersection over Union (IoU)

Non-max suppression

Network Architecture

Training

Limitation of YOLO

Conclusion

Introduction:

You Only Look Once (YOLO) is a new and faster approach to object detection. Traditional systems repurpose classifiers to perform detection. Basically, to detect any object, the system takes a classifier for that object and then classifies its presence at various locations in the image. Other systems generate potential bounding boxes in an image using region proposal methods and then run a classifier on these potential boxes. This results in a slightly efficient method. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detection, etc. Due to these complexities, the system becomes slow and hard to optimize because each component has to be trained separately.

Object Detection with Confidence Score

Why YOLO?

The base model can process images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO can process images at 155 frames per second while achieving double the mAP of other real-time detectors. It outperforms other detection methods, including DPM (Deformable Parts Models) and R-CNN.

How Does It Work?

YOLO reframes object detection as a single regression problem instead of a classification problem. This system only looks at the image once to detect what objects are present and where they are, hence the name YOLO.

The system divides the image into an S x S grid. Each of these grid cells predicts B bounding boxes and confidence scores for these boxes. The confidence score indicates how sure the model is that the box contains an object and also how accurate it thinks the box is that predicts. The confidence score can be calculated using the formula:

C = Pr(object) * IoU

IoU: Intersection over Union between the predicted box and the ground truth.

If no object exists in a cell, its confidence score should be zero.

Bounding Box Predictions (Source: Author)

Each bounding box consists of five predictions: x, y, w, h, and confidence where,

(x,y): Coordinates representing the center of the box. These coordinates are calculated with respect to the bounds of the grid cells.

w: Width of the bounding box.

h: Height of the bounding box.

Each grid cell also predicts C conditional class probabilities Pr(Classi|Object). It only predicts one set of class probabilities per grid cell, regardless of the number of boxes B. During testing, these conditional class probabilities are multiplied by individual box confidence predictions which give class-specific confidence scores for each box. These scores show both the probability of that class and how well the box fits the object.

Pr(Class i|Object)*Pr(Object)*IoU = Pr(Class i)*IoU.

The final predictions are encoded as an S x S x (B*5 + C) tensor.

Intersection Over Union (IoU):

IoU is used to evaluate the object detection algorithm. It is the overlap between the ground truth and the predicted bounding box, i.e it calculates how similar the predicted box is with respect to the ground truth.