This is a simple naive approach that can still be useful many times.

The scenario here is that you trained your own classifier, but want to reuse it for object detection without too much fuss.

Typical deep learning image classifiers have fully connected layers at the end of the network.

One common trick is to convert a deep learning network to be fully convolutional, by converting the fully connected layers to be equivalent convolutional layers.

The converted network can now receive larger images, and slide the original classifier across the image. The result is a score for every sliding window center location.

For more details check the Converting FC layers to CONV layers section here.

The example images above were taken from a keras implementation of this that you can use.

Now you can threshold the resulting images to get locations of your objects.

If you want invariance to scale, you can then resize the images to various sizes (larger images for detecting smaller objects) and apply the detector.

Typically you can base off an existing imagenet classifier, create a custom classifier by fine tuning it on your own data, and then use some image processing heuristics to get the most probable locations of objects in the image.

To get more accuracy, you might often need to add a new category to the classifier, which represents the background – a “not an object” category.

And now we enter an interesting zone of how to sample patches from the background.

Many windows in the image might contain only a small part of a ground truth object,

And some of them might contain two or more fractions of objects.

Our ideal detector should learn to distinguish many edge cases like these.

In the learning phase, many detectors like SSD, YOLO and even HOG, would sample random patches from the background, and use them as a “not an object” category.

Then if there are false detections, sometimes “hard negative mining” is applied, and problematic patches are given special attention.

Ideally we would want to use all the windows in the image, since that contains much more information than using just a small random subset of the windows.

To do that, we would need a clever window scoring function, that can, for example, learn that although a window contains many details from an object, there probably is a different nearby window responsible for the main part of the object.

Usually after a sliding window detection, a Non Maximal Suppression is applied, usually using a greedy algorithm.

An example of such a greedy algorithm would be to sort the windows by their scores, and keep the best scoring windows that do not overlap.