For a primer on Neural Network concepts, please visit our first post in this series.

Over the past few years, we here at Condé Nast have invested heavily in building Machine Learning (ML) tools to help us understand our content and how our users interact with it. These efforts have mainly focused on Natural Language Processing (NLP) where we have created tools that can automatically detect topics, entities (e.g., people, organizations, products), and keywords in an article published by any of our brands. This information serves as useful building blocks for other tools to improve both the user and editorial experience.

In addition to textual content, our stories include vibrant videos and images. These graphical media provide exciting frontiers where we can utilize and Machine Learning to enhance our experiences.

Why Handbags?

After considering a few ideas, we decided to prototype a handbag brand classifier. We decided to focus on handbags because they are objects from the fashion domain, where Condé Nast already has a significant presence. Furthermore, from a computer vision perspective, handbags are rather complex objects. Many brands have features that distinguish them visually. These features range from the more obvious (e.g., patterns, logos) to the less visible (e.g., textures, pockets, latches, straps). Indeed, a “human expert” can make a reasonably good prediction of the handbag's brand without having seen that exact model.

If you’re not yet familiar with neural networks or general Machine Learning terminology, take a look at our Neural Network Primer first.

Approach

Data Collection

The data used here was collected from Instagram using both hashtags as well as brand and fan pages. All the images were reviewed manually before being added to the dataset. The data contains selfies and other amateur images, white background studio style images, professional fashion and runway images. An image was allowed to contain more than one handbag but since we did not include any object detection we only included multiple handbags if they were the same brand. In total we collected a relatively balanced dataset of approximately 17,000 images across these 7 brands and 1 negative class:

Coach (1786)

Gucci (1256)

Prada (1533)

Louis Vuitton (1643)

Marc Jacobs (1433)

Kate Spade (1256)

Chanel (1454)

No Handbags (6330)

As we iterated on training the model we used the current version of the model to ‘pre-tag’ the images as this greatly sped up the data collection process and validation.

Training the Model

The model was implemented in TensorFlow running on an AWS p2.xlarge instance. This machine is equipped with a single Tesla K80 gpu. We used the Inception-v3 architecture and this model which we initialized from a model pre-trained on the ImageNet dataset available here . A better organization of the model would likely have been to split the tasks into two separate classification tasks, one binary (handbag, no handbag) and if a handbag is present, a multiclass brand classifier, however in this prototype we made it a one level classifier. Two hundred images from each class were held out from the training data and used for testing. It took on average about one day to train a model.

We found that the best performing model was generated when we initially trained the full network on the classification task. We then took the representation produced by that model and retrained the final classification layer (the representation here refers to taking the last layer of the neural network before the classification layer, this is often referred to as an embedding). This improvement appears to be related to the regularization of the model and could not be replicated by increasing the regularization of the final layer during the training of the full network. Perhaps this is due to compensation upstream in the trunk of the network when the full network is being trained.

Results

Brand Detection

The results from the best model are presented below. The numbers in the confusion matrix are scores from the held-out test set, where the rows are the true labels and the columns the predicted labels.

This is the Precision, Recall, and F1 scores from the same data:

Where precision, recall, and F1 scores are defined as:

$$ precision = {{true\space positives}\over{true\space positives \space+\space false\space positive}} $$

$$ recall = {{true\space positives}\over{true\space positives\space+\space false\space negatives}} $$

$$ f_1 score = 2 \space \cdot \space {{precision\space \cdot \space recall}\over{{precision\space + \space recall}}} $$

Object Localization and Color Detection

While the above demonstrates the feasibility of building a handbag detection/branding, we wanted to see if we could dig a bit deeper. One feature, in addition to the brand, that we could be able to extract from these images is the color of the bag. Since the bag rarely makes up the majority of the image this is not a trivial task, we first need to locate where in the image the bag appears.

Current work at Condé Nast is focused on building models specifically for identifying objects in images (not just classifying the entire image) using specialized architectures and training data labeled on the object level (that is an image would be labeled with both the object and the coordinates of a bounding box of where in the image the object is located). However, such data is rare and expensive to generate, and we wanted to see if we could leverage this image level model to obtain object level information. So we attempted to do so by trying to extract colors from the detected handbags. In 2016 Zhou et. al published a method for locating the region of the image that provided the strongest signal for a given class, even based on image level labeling. We used this method coupled with a color extraction algorithm to detect additional information regarding the handbag in the image. This effectively gives us a classifier that can detect both brand and color of handbags.

Our approach thus involves two steps.

Object Localization - Identify the region of the image where the handbag is located. Color Extraction - Extract the primary colors from that region.

Let’s walk through these two steps:

Object Localization

On a high-level, we do this by backtracking the signal from the output layer to the last layer of the Convolutional Neural Network (CNN) that still maintains the width and height dimensions of the image. To understand this in more detail we need to take a look at the last few layers of our network.

The last convolutional layer is a R 8x8x2048 matrix. In the figure above each depth layer is depicted as a slice of the volume (only a few are depicted for clarity). Each slice is averaged pooled (i.e. taking the mean of the slice) transforming the R 8x8x2048 matrix to a R 2048 vector, where each of the 2048 dimensions represents the signal from one of the depth layers of the last convolutional layer. We then take the dot product of this vector with the weight matrix of the output layer to get the class scores and use a softmax function to project our class score onto the probability simplex for our classification task.

If we now want to get an idea of what region of the image elicit the response we need to backtrack this process. That is given a prediction, we can see what the contributions from the different dimensions of the fully connected layer were and thus what were the relative contribution to the signal from the slices in the convolutional volume. We can then find what areas in the height and width dimensions that gave the strongest signals by taking a weighted average of all the slices.

If we look down into the convolutional layer, we can find the region in the height x width dimension that elicited the classification response illustrated above by the red region. More formally we can create a heatmap by backtracking from the output layer.

$$ \alpha =y^{T}W $$

$$ M = RELU\left ( \sum_{i=0}^{n}\alpha_{i}A_{i} \right ) $$

Where y is the one-hot vector for the class prediction W is the weight matrix of the output layer and alpha is the vector that represents the contribution of each of the depth slices in the last convolutional layer. A i is the R 8x8 matrix for the i th depth layer. Recall that each depth layer (aka filter) is looking for specific features in the input layer, this analysis utilizes the fact that our object (handbag) will have similar features no matter where in the image it is positioned. Here is an example of a heatmap overlayed on the input image:

Color Detection

Once we have located the region of the image containing the handbag we can do the color extraction. This is not a trivial task since at the pixel level the variance of colors is quite large even if we perceive it as one relatively uniform color. We can do a reasonable job in the following two steps:

Segment the region into superpixels. Superpixels are clusters of similar pixels based on both color similarity and physical proximity. We can then average the color in each superpixel. Cluster the superpixels based on colors. Once we have the superpixels we use K-means to find clusters by color similarity we can then report the average cluster colors and the proportion of pixels assigned to them. The resulting cluster colors appear to be close to what our eye perceives. See some examples below.

In these examples, it is evident that the model is able to capture the color of the handbag (i.e., object level information) even though it though no information regarding the location of the object was explicitly available in the training data.

Summary

This research shows that we can leverage existing open-sourced CNN architectures to build domain-specific computer vision models that perform close to human expert level. It also suggests that we can potentially utilize Condé Nast's vast multimedia data set and it's rich meta information to bootstrap object level computer vision models.