3. Implementation of YOLO v3 detection layers.

Features extracted by Darknet-53 are directed to the detection layers. The detection module is built from some number of conv layers grouped in blocks, upsampling layers and 3 conv layers with linear activation function, making detections at 3 different scales. Let’s start with writing helper function _yolo_block :

def _yolo_block(inputs, filters):

inputs = _conv2d_fixed_padding(inputs, filters, 1)

inputs = _conv2d_fixed_padding(inputs, filters * 2, 3)

inputs = _conv2d_fixed_padding(inputs, filters, 1)

inputs = _conv2d_fixed_padding(inputs, filters * 2, 3)

inputs = _conv2d_fixed_padding(inputs, filters, 1)

route = inputs

inputs = _conv2d_fixed_padding(inputs, filters * 2, 3)

return route, inputs

Activations from 5th layer in the block are then routed to another conv layer and upsampled, while activations from 6th layer go to _detection_layer which we are going to define now:

def _detection_layer(inputs, num_classes, anchors, img_size, data_format):

num_anchors = len(anchors)

predictions = slim.conv2d(inputs, num_anchors * (5 + num_classes), 1, stride=1, normalizer_fn=None,

activation_fn=None, biases_initializer=tf.zeros_initializer())



shape = predictions.get_shape().as_list()

grid_size = _get_size(shape, data_format)

dim = grid_size[0] * grid_size[1]

bbox_attrs = 5 + num_classes



if data_format == 'NCHW':

predictions = tf.reshape(predictions, [-1, num_anchors * bbox_attrs, dim])

predictions = tf.transpose(predictions, [0, 2, 1])



predictions = tf.reshape(predictions, [-1, num_anchors * dim, bbox_attrs])



stride = (img_size[0] // grid_size[0], img_size[1] // grid_size[1])



anchors = [(a[0] / stride[0], a[1] / stride[1]) for a in anchors]



box_centers, box_sizes, confidence, classes = tf.split(predictions, [2, 2, 1, num_classes], axis=-1)



box_centers = tf.nn.sigmoid(box_centers)

confidence = tf.nn.sigmoid(confidence)



grid_x = tf.range(grid_size[0], dtype=tf.float32)

grid_y = tf.range(grid_size[1], dtype=tf.float32)

a, b = tf.meshgrid(grid_x, grid_y)



x_offset = tf.reshape(a, (-1, 1))

y_offset = tf.reshape(b, (-1, 1))



x_y_offset = tf.concat([x_offset, y_offset], axis=-1)

x_y_offset = tf.reshape(tf.tile(x_y_offset, [1, num_anchors]), [1, -1, 2])



box_centers = box_centers + x_y_offset

box_centers = box_centers * stride



anchors = tf.tile(anchors, [dim, 1])

box_sizes = tf.exp(box_sizes) * anchors

box_sizes = box_sizes * stride



detections = tf.concat([box_centers, box_sizes, confidence], axis=-1)



classes = tf.nn.sigmoid(classes)

predictions = tf.concat([detections, classes], axis=-1)

return predictions

This layer transforms raw predictions according to following equations. Because YOLO v3 on each scale detects objects of different sizes and aspect ratios , anchors argument is passed, which is a list of 3 tuples (height, width) for each scale. The anchors need to be tailored for dataset (in this tutorial we will use anchors for COCO dataset). Just add this constant somewhere on top of yolo_v3.py file.

_ANCHORS = [(10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (59, 119), (116, 90), (156, 198), (373, 326)]

Source: YOLO v3 paper

We need one small helper function _get_size which returns height and width of the input:

def _get_size(shape, data_format):

if len(shape) == 4:

shape = shape[1:]

return shape[1:3] if data_format == 'NCHW' else shape[0:2]

As mentioned earlier, the last building block that we need to implement YOLO v3 is upsampling layer. YOLO detector uses bilinear upsampling method. Why can’t we just use standard tf.image.resize_bilinear method from Tensorflow API? The reason is, as for today (TF version 1.8.0), all upsampling methods use constant pad mode. Standard pad method in YOLO authors repo and in PyTorch is edge (good comparison of padding modes can be found here). This minor difference has significant impact on the detections (and cost me a couple of hours of debugging).

To work around this we will manually pad inputs with 1 pixel and mode='SYMMETRIC' , which is the equivalent of edge mode.

# we just need to pad with one pixel, so we set kernel_size = 3

inputs = _fixed_padding(inputs, 3, 'NHWC', mode='SYMMETRIC')

Whole _upsample function code looks as follows:

def _upsample(inputs, out_shape, data_format='NCHW'):

# we need to pad with one pixel, so we set kernel_size = 3

inputs = _fixed_padding(inputs, 3, mode='SYMMETRIC')



# tf.image.resize_bilinear accepts input in format NHWC

if data_format == 'NCHW':

inputs = tf.transpose(inputs, [0, 2, 3, 1])



if data_format == 'NCHW':

height = out_shape[3]

width = out_shape[2]

else:

height = out_shape[2]

width = out_shape[1]



# we padded with 1 pixel from each side and upsample by factor of 2, so new dimensions will be

# greater by 4 pixels after interpolation

new_height = height + 4

new_width = width + 4



inputs = tf.image.resize_bilinear(inputs, (new_height, new_width))



# trim back to desired size

inputs = inputs[:, 2:-2, 2:-2, :]



# back to NCHW if needed

if data_format == 'NCHW':

inputs = tf.transpose(inputs, [0, 3, 1, 2])



inputs = tf.identity(inputs, name='upsampled')

return inputs

UPDATE: Thanks to Srikanth Vidapanakal, I checked the source code of darknet and found out that the upsampling method is nearest neighbor, not bilinear. We don’t need to pad image anymore. Updated code is already available in my repo.

Fixed _upsample function code looks as follows:

def _upsample(inputs, out_shape, data_format='NCHW'):

# tf.image.resize_nearest_neighbor accepts input in format NHWC

if data_format == 'NCHW':

inputs = tf.transpose(inputs, [0, 2, 3, 1])



if data_format == 'NCHW':

new_height = out_shape[3]

new_width = out_shape[2]

else:

new_height = out_shape[2]

new_width = out_shape[1]



inputs = tf.image.resize_nearest_neighbor(inputs, (new_height, new_width))



# back to NCHW if needed

if data_format == 'NCHW':

inputs = tf.transpose(inputs, [0, 3, 1, 2])



inputs = tf.identity(inputs, name='upsampled')

return inputs

Upsampled activations are concatenated along channels axis with activations from Darknet-53 layers. That is why we need to go back to darknet53 function and return activations from conv layers before 4th and 5th downsampling layers.

def darknet53(inputs):

"""

Builds Darknet-53 model.

"""

inputs = _conv2d_fixed_padding(inputs, 32, 3)

inputs = _conv2d_fixed_padding(inputs, 64, 3, strides=2)

inputs = _darknet53_block(inputs, 32)

inputs = _conv2d_fixed_padding(inputs, 128, 3, strides=2)



for i in range(2):

inputs = _darknet53_block(inputs, 64)



inputs = _conv2d_fixed_padding(inputs, 256, 3, strides=2)



for i in range(8):

inputs = _darknet53_block(inputs, 128)



route1 = inputs

inputs = _conv2d_fixed_padding(inputs, 512, 3, strides=2)



for i in range(8):

inputs = _darknet53_block(inputs, 256)



route2 = inputs

inputs = _conv2d_fixed_padding(inputs, 1024, 3, strides=2)



for i in range(4):

inputs = _darknet53_block(inputs, 512)



return route1, route2, inputs

Now we are ready to define detector module. Let’s get back to yolo_v3 function and add following lines under the slim arg scope: