I approached this task as a semantic segmentation problem. The goal of semantic segmentation is to detect objects in an image; it does this by making per-pixel classifications.

This street image from the Camvid dataset is a standard example of how this works. Every pixel in the image has been labeled as belonging to some class of object, be it a tree, building, car or human.

The task then is to build a model that predicts the class of each pixel. For the purposes of detecting Waldo, there are only two classes for our images: Waldo, and not-Waldo.

The first step is to create these label images. I took 18 Where’s Waldo images from Valentino Constantinou’s collection, and created bounding boxes using labelImg.

Zoomed in for clarity.

This bounding box to the left is indicative of the other 17: they are all squarely around Waldo’s head, and naturally include some of the surrounding background. Traditionally for semantic segmentation, we would only want pixels depicting Waldo to be labelled. Unfortunately I’m not familiar with any easy way to label pixel by pixel, nor did I think it necessary to make that effort.

Once I set these boxes, I built binary label images representing not-Waldo and Waldo. In general these look something like this:

Example label image, zoomed in. Purple — no Waldo, Yellow — Waldo

Great! Now I have inputs and targets, essential ingredients to training our model. But before we get to that, we need to address a few problems:

Even with appropriate data augmentation (which is limited in this domain), 18 images is simply not nearly enough data to sufficiently train a neural network.

These images are far too large to load into memory, even on a Titan X.

Worse, downsampling makes these finely detailed images completely incomprehensible.

I addressed these problems by dynamically sampling 224 x 224 sub-images from the original 18 online:

With image sizes of 2800 x 1760 and random horizontal reflection, this provides us with approx. 142 million unique sample images.

This image size is perfectly manageable resource-wise.

Waldo’s head is typically 60 x 60 pixels. Samples of size 224 x 224 are easily large enough to contain the local information needed to make accurate predictions.

The second point is crucial. Tiramisu is fully convolutional: it exploits local information only. This means that we can essentially train a network for the complete, high resolution images (our main goal) by training on carefully managed sample images. Of course the sample images will have edges that don’t exist in the whole image, but the effect is negligible.

Four samples furthest from center of a “Waldo Image”, all completely contain Waldo.

About carefully managing sampling: the large amount of possible sample images and small batch size (~6 on Titan X) means I had to make sure there was an appropriate number of images containing Waldo in each batch.

In order to do this I isolated 18 “Waldo Images” for each original. These images are constructed so that every random 224 x 224 sample contains a complete Waldo.

I also made sure that when sampling from the full images, I omitted any that contained Waldo, partially or otherwise. This was helpful to make sure full image sampling produced negatives, and to avoid forcing the network to learn from unhelpful/incomplete positives (e.g. it wouldn’t be useful to learn from an image containing only the tip of Waldo’s hat).