Screenshot from a video made by Joseph Redmon on Youtube

Originally published at https://www.yanjia.li on Dec 30, 2019.

For full source code, please go to https://github.com/ethanyanjiali/deep-vision/tree/master/YOLO/tensorflow. I really appreciate your STAR that supports my efforts.

Foreword

When a self-driving car runs on a road, how does it know where are other vehicles in the camera image? When an AI radiologist reading an X-ray, how does it know where the lesion (abnormal tissue) is? Today, I will walk through this fascinating algorithm, which can identify the category of the given image, and also locate the region of interest. There’s plenty of algorithms introduced in recent years to address object detection in a deep learning approach, such as R-CNN, Faster-RCNN, and Single Shot Detector. Among those, I’m most interested in a model called YOLO — You Only Look Once. This model attracts me so much, not only because of its funny name, but also some practical design that truly makes sense for me. In 2018, this latest V3 of this model had been released, and it achieved many new State of the Art performance. Because I’ve programmed some GANs and image classification networks before, and also Joseph Redmon described it in the paper in a really easy-going way, I thought this detector would just be another stack of CNN and FC layers that just works well magically.

But I was wrong.

Perhaps it’s because I’m just dumber than usual engineers, I found it really difficult for me to translate this model from the paper to actual code. And even when I managed to do that in a couple of weeks (I gave up once put it away for a few weeks), I found it even more difficult for me to make it work. There’re so quite a few blogs, GitHub repos about YOLO V3, but most of them just gave a very high-level overview of the architecture, and somehow they just succeed. Even worse, the paper itself is too chill that it fails to provide many crucial details of implementation, and I have to read the author’s original C implementation (when is the last time did I write C? Maybe at college?) to confirm some of my guesses. When there’s a bug, I usually have no idea why it would occur. Then I end up manually debugging it step by step and calculating those formulas with my little calculator.

Fortunately, I didn’t give up this time and finally made it work. But in the meantime, I also felt really strongly that there should be a more thorough guide out there on the internet to help dumb people like me to understand every detail of this system. After all, if one single detail is wrong, the whole system would go south quickly. And I’m sure that if I don’t write these down, I would forget all these in few weeks too. So, here I am, presenting you this “Dive Really Deep into YOLO V3: A Beginner’s Guide”. I hope you’ll like it.

Prerequisite

Before getting into the network itself, I’ll need to clarify with some prerequisites first. As a reader, you are expected to:

1. Understand the basics of Convolutional Neural Network and Deep Learning

2. Understand the idea of object detection task

3. Have curiosity about how the algorithm works internally

If you need help on first two items, there’re plenty of excellent resources like Udacity Computer Vision Nanodegree, Cousera Deep Learning Specialization and Stanford CS231n

If you just want to build something to detect some object with your custom dataset quickly, check out this Tensorflow Object Detection API

YOLO V3

YOLO V3 is an improvement over previous YOLO detection networks. Compared to prior versions, it features multi-scale detection, stronger feature extractor network, and some changes in the loss function. As a result, this network can now detect many more targets from big to small. And, of course, just like other single-shot detectors, YOLO V3 also runs quite fast and makes real-time inference possible on GPU devices. Well, as a beginner to object detection, you might not have a clear image of what do they mean here. But you will gradually understand them later in my post. For now, just remember that YOLO V3 is one of the best models in terms of real-time object detection as of 2019.

Network Architecture

Diagram by myself

First of all, let’s talk about how this network look like at a high-level diagram (Although, the network architecture is the least time-consuming part of implementation). The whole system can be divided into two major components: Feature Extractor and Detector; both are multi-scale. When a new image comes in, it goes through the feature extractor first so that we can obtain feature embeddings at three (or more) different scales. Then, these features are feed into three (or more) branches of the detector to get bounding boxes and class information.

Darknet-53

The feature extractor YOLO V3 uses is called Darknet-53. You might be familiar with the previous Darknet version from YOLO V1, where there’re only 19 layers. But that was like a few years ago, and the image classification network has progressed a lot from merely deep stacks of layers. ResNet brought the idea of skip connections to help the activations to propagate through deeper layers without gradient diminishing. Darknet-53 borrows this idea and successfully extends the network from 19 to 53 layers, as we can see from the following diagram.

Diagram from YOLOv3: An Incremental Improvement

This is very easy to understand. Consider layers in each rectangle as a residual block. The whole network is a chain of multiple blocks with some strides 2 Conv layers in between to reduce dimension. Inside the block, there’s just a bottleneck structure (1x1 followed by 3x3) plus a skip connection. If the goal is to do multi-class classification as ImageNet does, an average pooling and a 1000 ways fully connected layers plus softmax activation will be added.

However, in the case of object detection, we won’t include this classification head. Instead, we are going to append a “detection” head to this feature extractor. And since YOLO V3 is designed to be a multi-scaled detector, we also need features from multiple scales. Therefore, features from last three residual blocks are all used in the later detection. In the diagram below, I’m assuming the input is 416x416, so three scale vectors would be 52x52, 26x26, and 13x13. Please note that if the input size is different, the output size will differ too.

Diagram by myself

Multi-scale Detector

Once we have three features vectors, we can now feed them into the detector. But how should we structure this detector? Unfortunately, the author didn’t bother to explain this part this his paper. But we could still take a look at the source code he published on Github. Through this config file, multiple 1x1 and 3x3 Conv layers are used before a final 1x1 Conv layer to form the final output. For medium and small scale, it also concatenates features from the previous scale. By doing so, small scale detection can also benefit from the result of large scale detection.

Diagram by myself

Assuming the input image is (416, 416, 3), the final output of the detectors will be in shape of [(52, 52, 3, (4 + 1 + num_classes)), (26, 26, 3, (4 + 1 + num_classes)), (13, 13, 3, (4 + 1 + num_classes))]. The three items in the list represent detections for three scales. But what do the cells in this 52x52x3x(4+1+num_classes) matrix mean? Good questions. This brings us to the most important notion in pre-2019 object detection algorithm: anchor box (prior box).

Anchor Box

The goal of object detection is to get a bounding box and its class. Bounding box usually represents in a normalized xmin, ymin, xmax, ymax format. For example, 0.5 xmin and 0.5 ymin mean the top left corner of the box is in the middle of the image. Intuitively, if we want to get a numeric value like 0.5, we are facing a regression problem. We may as well just have the network predict for values and use Mean Square Error to compare with the ground truth. However, due to the large variance of scale and aspect ratio of boxes, researchers found that it’s really hard for the network to converge if we just use this “brute force” way to get a bounding box. Hence, in Faster-RCNN paper, the idea of an anchor box is proposed.

Anchor box is a prior box that could have different pre-defined aspect ratios. These aspect ratios are determined before training by running K-means on the entire dataset. But where does the box anchor to? We need to introduce a new notion called the grid. In the “ancient” year of 2013, algorithms detect objects by using a window to slide through the entire image and running image classification on each window. However, this is so inefficient that researchers proposed to use Conv net to calculate the whole image all in once (technically, only when your run convolution kernels in parallel.) Since the convolution outputs a square matrix of feature values (like 13x13, 26x26, and 52x52 in YOLO), we define this matrix as a “grid” and assign anchor boxes to each cell of the grid. In other words, anchor boxes anchor to the grid cells, and they share the same centroid. And once we defined those anchors, we can determine how much does the ground truth box overlap with the anchor box and pick the one with the best IOU and couple them together. I guess you can also claim that the ground truth box anchors to this anchor box. In our later training, instead of predicting coordinates from the wild west, we can now predict offsets to these bounding boxes. This works because our ground truth box should look like the anchor box we pick, and only subtle adjustment is needed, whhich gives us a great head start in training.

Diagram by myself

In YOLO v3, we have three anchor boxes per grid cell. And we have three scales of grids. Therefore, we will have 52x52x3, 26x26x3 and 13x13x3 anchor boxes for each scale. For each anchor box, we need to predict 3 things:

1. The location offset against the anchor box: tx, ty, tw, th. This has 4 values.

2. The objectness score to indicate if this box contains an object. This has 1 value.

3. The class probabilities to tell us which class this box belongs to. This has num_classes values.

In total, we are predicting 4 + 1 + num_classes values for one anchor box, and that’s why our network outputs a matrix in shape of 52x52x3x(4+1+num_classes) as I mentioned before. tx, ty, tw, th isn’t the real coordinates of the bounding box. It’s just the relative offsets compared with a particular anchor box. I’ll explain these three predictions more in the Loss Function section after.

Anchor box not only makes the detector implementation much harder and much error-prone, but also introduced an extra step before training if you want the best result. So, personally, I hate it very much and feel like this anchor box idea is more a hack than a real solution. In 2018 and 2019, researchers start to question the need for anchor box. Papers like CornerNet, Object as Points, and FCOS all discussed the possibility of training an object detector from scratch without the help of an anchor box.

Loss Function

With the final detection output, we can calculate the loss against the ground truth labels now. The loss function consists of four parts (or five, if you split noobj and obj): centroid (xy) loss, width and height (wh) loss, objectness (obj and noobj) loss and classification loss. When putting together, the formula is like this:

Loss = Lambda_Coord * Sum(Mean_Square_Error((tx, ty), (tx’, ty’) * obj_mask)

+ Lambda_Coord * Sum(Mean_Square_Error((tw, th), (tw’, th’) * obj_mask)

+ Sum(Binary_Cross_Entropy(obj, obj’) * obj_mask) + Lambda_Noobj * Sum(Binary_Cross_Entropy(obj, obj’) * (1 -obj_mask) * ignore_mask)

+ Sum(Binary_Cross_Entropy(class, class’))

It looks intimidating but let me break them down and explain one by one.

xy_loss = Lambda_Coord * Sum(Mean_Square_Error((tx, ty), (tx’, ty’)) * obj_mask)

The first part is the loss for bounding box centroid. tx and ty is the relative centroid location from the ground truth. tx’ and ty’ is the centroid prediction from the detector directly. The smaller this loss is, the closer the centroids of prediction and ground truth are. Since this is a regression problem, we use mean square error here. Besides, if there’s no object from the ground truth for certain cells, we don’t need to include the loss of that cell into the final loss. Therefore we also multiple by obj_mask here. obj_mask is either 1 or 0, which indicates if there’s an object or not. In fact, we could just use obj as obj_mask, obj is the objectness score that I will cover later. One thing to note is that we need to do some calculation on ground truth to get this tx and ty. So, let’s see how to get this value first. As the author says in the paper:

bx = sigmoid(tx) + Cx

by = sigmoid(ty) + Cy

Here bx and by are the absolute values that we usually use as centroid location. For example, bx = 0.5, by = 0.5 means that the centroid of this box is the center of the entire image. However, since we are going to compute centroid off the anchor, our network is actually predicting centroid relative the top-left corner of the grid cell. Why grid cell? Because each anchor box is bounded to a grid cell, they share the same centroid. So the difference to grid cell can represent the difference to anchor box. In the formula above, sigmoid(tx) and sigmoid(ty) are the centroid location relative to the grid cell. For instance, sigmoid(tx) = 0.5 and sigmoid(ty) = 0.5 means the centroid is the center of the current grid cell (but not the entire image). Cx and Cy represents the absolute location of the top-left corner of the current grid cell. So if the grid cell is the one in the SECOND row and SECOND column of a grid 13x13, then Cx = 1 and Cy = 1. And if we add this grid cell location with relative centroid location, we will have the absolute centroid location bx = 0.5 + 1 and by = 0.5 + 1. Certainly, the author won’t bother to tell you that you also need to normalize this by dividing by the grid size, so the true bx would be 1.5/13 = 0.115. Ok, now that we understand the above formula, we just need to invert it so that we can get tx from bx in order to translate our original ground truth into the target label. Lastly, Lambda_Coord is the weight that Joe introduced in YOLO v1 paper. This is to put more emphasis on localization instead of classification. The value he suggested is 5.

Diagram from YOLOv3: An Incremental Improvement

wh_loss = Lambda_Coord * Sum(Mean_Square_Error((tw, th), (tw’, th’)) * obj_mask)

The next one is the width and height loss. Again, the author says:

bw = exp(tw) * pw

bh = exp(th) * ph

Here bw and bh are still the absolute width and height to the whole image. pw and ph are the width and height of the prior box (aka. anchor box, why there’re so many names). We take e^(tw) here because tw could be a negative number, but width won’t be negative in real world. So this exp() will make it positive. And we multiply by prior box width pw and ph because the prediction exp(tw) is based off the anchor box. So this multiplication gives us real width. Same thing for height. Similarly, we can inverse the formula above to translate bw and bh to tx and th when we calculate the loss.

obj_loss = Sum(Binary_Cross_Entropy(obj, obj’) * obj_mask) noobj_loss = Lambda_Noobj * Sum(Binary_Cross_Entropy(obj, obj’) * (1 — obj_mask) * ignore_mask)

The third and fourth items are objectness and non-objectness score loss. Objectness indicates how likely is there an object in the current cell. Unlike YOLO v2, we will use binary cross-entropy instead of mean square error here. In the ground truth, objectness is always 1 for the cell that contains an object, and 0 for the cell that doesn’t contain any object. By measuring this obj_loss, we can gradually teach the network to detect a region of interest. In the meantime, we don’t want the network to cheat by proposing objects everywhere. Hence, we need noobj_loss to penalize those false positive proposals. We get false positives by masking prediciton with 1-obj_mask. The `ignore_mask` is used to make sure we only penalize when the current box doesn’t have much overlap with the ground truth box. If there is, we tend to be softer because it’s actually quite close to the answer. As we can see from the paper, “If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction.” Since there are way too many noobj than obj in our ground truth, we also need this Lambda_Noobj = 0.5 to make sure the network won’t be dominated by cells that don’t have objects.

class_loss = Sum(Binary_Cross_Entropy(class, class’) * obj_mask)

The last loss is classification loss. If there’re 80 classes in total, the class and class’ will be the one-hot encoding vector that has 80 values. In YOLO v3, it’s changed to do multi-label classification instead of multi-class classification. Why? Because some dataset may contains labels that are hierarchical or related, eg woman and person. So each output cell could have more than 1 class to be true. Correspondingly, we also apply binary cross-entropy for each class one by one and sum them up because they are not mutually exclusive. And like we did to other losses, we also multiply by this obj_mask so that we only count those cells that have a ground truth object.

To fully understand how this loss works, I suggest you manually walk through them with a real network prediction and ground truth. Calculating the loss by your calculator (or tf.math) can really help you to catch all the nitty-gritty details. And I did that by myself, which helped me find lots of bugs. After all, the devil is in the detail.

Implementation

If I stop writing here, my post will just be like another “YOLO v3 Review” somewhere on the web. Once you digest the general idea of YOLO v3 from the previous section, we are now ready to go explore the remaining 90% of our YOLO v3 journey: Implementation.

Framework

At the end of September, Google finally released TensorFlow 2.0.0. This is a fascinating milestone for TF. Nevertheless, new design doesn’t necessarily mean less pain for developers. I’ve been playing around TF 2 since very early of 2019 because I always wanted to write TensorFlow code in the way I did for PyTorch. If it’s not because of TensorFlow’s powerful production suite like TF Serving, TF lite, and TF Board, etc., I guess many developers will not choose TF for new projects. Hence, if you don’t have a strong demand for production deployment, I would suggest you implement YOLO v3 in PyTorch or even MXNet. However, if you made your mind to stick with TensorFlow, please continue reading.

TensorFlow 2 officially made eager mode a first-tier citizen. To put it simply, instead of using TensorFlow specific APIs to calculate in a graph, you can now leverage native Python code to run the graph in a dynamic mode. No more graph compilation and much easier debugging and control flow. In the case where performance is more important, a handy tf.function decorator is also provided to help compile the code into a static graph. But, the reality is, eager mode and tf.function are still buggy or not well documented sometimes, which makes your life even harder in a complicated system like YOLO v3. Also, Keras model isn’t quite flexible, while the custom training loop is still quite experimental. Therefore, the best strategy for you to write YOLO v3 in TF 2 is to start with a minimum working template first, and gradually add more logic to this shell. By doing so, we can fail early and fix the bug before it hides too deeply in a giant nested graph.

Dataset

Aside from the framework to choose, the most important thing for successful training is the dataset. In the paper, the author used MSCOCO dataset to validate his idea. Indeed, this is a great dataset, and we should aim for a good accuracy on this benchmark dataset for our model. However, a big dataset like this could also hide some bugs in your code. For example, if the loss is not dropping, how do you know if it just needs more time to converge, or your loss function is wrong? Even with 8x V100 GPU, the training is still not fast enough for you to quickly iterate and fix things. Therefore, I recommend you to build a development set which contains tens of images to make sure your code looks “working” first. Another option is to use VOC 2007 dataset, which only has 2500 training images. To use MSCOCO or VOC2007 dataset and create TF Records, you could refer to my helper scripts here: MSCOCO, VOC2007

Preprocessing

Preprocessing stands for the operations to translate raw data into a proper input format of the network. For the image classification task, we usually just need to resize the image, and one-hot encode the label. But things are a bit more complicated for YOLO v3. Remember I said the output of the network is like 52x52x3x(4+1+num_classes) and has three different scales? Since we need to calculate the delta between ground truth and prediction, we also need to format our ground truth into such a matrix first.

For each ground truth bounding box, we need to pick the best scale and anchor for it. For example, a tiny kite in the sky should be in the small scale (52x52). And if the kite is more like a square in the image, we should also pick the most square-shaped anchor in that scale. In YOLO v3, the author provides 9 anchors for 3 scales. All we need to do is to choose the one that matches our ground truth box the most. When I implement this, I thought I need the coordinates of the anchor box as well to calculate IOU. In fact, you don’t need to. Since we just want to know which anchor fits our ground truth box best, we can just assume all anchors and the ground truth box share the same centroid. And with this assumption, the degree of matching would be the overlapping area, which can be calculated by min width * min height.

During the transformation, one could also add some data augmentation to increase the variety of training set virtually. For example, typical augmentation includes random flipping, random cropping, and random translating. However, these augmentations won’t block you from training a working detector, so I won’t cover much about this advanced topic.

Training

After all these discussions, you finally have a chance to run “python train.py” and start your model training. And this is also when you meet most of your bugs. You could refer to my training script here when you are blocked. Meanwhile, I want to provide some tips that are helpful for my own training.

NaN Loss

Check your learning rate and make sure it’s not too high to explode your gradient. Check for 0 in binary cross-entropy because ln(0) is not a number. You can clip the value from (epsilon, 1 — epsilon). Find an example and walk through your loss step by step. Find out which part of your loss goes to NaN. For example, if width/height loss went to NaN, it could be because the way you calculate from tw to bw is wrong.

Loss remains high

Try to increase your learning rate to see if it can drop faster. Mine starts at 0.01. But I’ve seen 1e-4 and 1e-5 works too. Visualize your preprocessed ground truth to see if it makes sense. One problem I had before is that my output grid is in [y][x] instead of [x][y], but my ground truth is reversed. Again, manually walk through your loss with a real example. I had a mistake of calculating cross-entropy between objectness and class probabilities. My loss also remains around 40 after 50 epochs of MSCOCO. However, the result isn’t that bad. Double-check the coordinates format throughout your code. YOLO requires xywh (centroid x, centroid y, width and height), but most of dataset comes as x1y1x2y2 (xmin, ymin, xmax, ymax). Double-check your network architecture. Don’t get misled by the diagram from a post called “A Closer Look at YOLOv3 — CyberAILab”. tf.keras.losses.binary_crossentropy isn’t the sum of binary cross-entropy you need.

Loss is low, but the prediction is off

Adjusting lambda_coord or lambda_noobj to the loss based on your observation. If you are traininig on your own dataset, and the dataset is relative small (< 30k images), you should intialize weights from a COCO pretrained model first. Double-check your non max suppression code and adjust some threshold (I’ll talk about NMS later). Make sure your obj_mask in the loss function isn’t mistakenly taking out necessary elements. Again and again, your loss function. When calculating loss, it uses relative xywh in a cell (also called tx, ty, tw, th). When calculating ignore mask and IOU, it uses absolute xywh in the whole image, though. Don’t mix them up.

Loss os low, but there’s no prediction

If you are using a custom dataset, please check the distribution of your ground truth boxes first. The amount and quality of the boxes could really affect what the network learn (or cheat) to do. Predict on your training set to see if your model can overfit on the training set at least.

Multi-GPU training

Since the object detection network has so many parameters to train, it’s always better to have more computing power. For example, On MSCOCO 2017 Dataset, 8x Telsa V100 GPUs can train a whole epoch in 10 minutes, while 1x V100 would need more than 1 hour.

However, TensorFlow 2.0 doesn’t have great support over multi-GPU training so far. To do that in TF, you’ll need to pick a training strategy like MirroredStrategy, as I did here. Then wrap your dataset loader into a distributed version too. One caveat for distributed training is that the loss coming out of each batch should be divided by the global batch size because we are going to `reduce_sum` over all GPU results. For example, if the local batch size is 8, and there’re 8 GPUs, your batch loss should divide a global batch size of 64. Once you summed up losses from all replica, the final result will be the average loss of a single example.

Postprocessing

The final component in this detection system is a post-processor. Usually, postprocessing is just about trivial things like replacing machine-readable class id with human-readable class text. In object detection, though, we have one more crucial step to do to get final human-readable results. This is called non maximum suppression.

Let’s recall our objectness loss. When is false proposal has great overlap with ground truth, we won’t penalize it with noobj_loss. This encourages the network to predict close results so that we can train it more easily. Also, although not used in YOLO, when the sliding window approach is used, multiple windows could predict the same object. In order to eliminate these duplicate results, smart researchers designed an algorithm called non maximum supression (NMS).

Photo by Python Lessons from Analytics Vidhya

The idea of NMS is quite simple. Find out the detection box with the best confidence first, add it to the final result, and then eliminates all other boxes which have IOU over a certain threshold with this best box. Next, you choose another box with the best confidence in the remaining boxes and do the same thing over and over until nothing is left. In the code, since TensorFlow needs explicit shape most of the time, we will usually define a max number of detection and stop early if that number is reached. In YOLO v3, our classification is not mutually exclusive anymore, and one detection could have more than one true class. However, some existing NMS code doesn’t take that into consideration, so be careful when you use them.

Conclusion

YOLO v3 is a masterpiece in the rising era of Artificial Intelligence, and also an excellent summary of Convolution Neural Network techniques and tricks in the 2010s. Although there’re many turn-key solutions like Detectron out there to simplify the process of making a detector, a hands-on experience in coding such sophisticated detector is really a great learning opportunity for machine learning engineers because merely reading the paper is far from enough. Like Ray Dalio said about his philosophy:

Pain plus reflection equals progress.

I hope my article could be a lighthouse in your painful journey of implementing YOLO v3, and perhaps you can also share the delightful progress with us later. If you like my article or my source code of YOLO v3, please ⭐star⭐ my repo and that will be the biggest support for me.

References