Product categories are the structural backbone of every online shop, but it can be quite a nightmare for e-commerce managers to make sure that all products are assigned to the correct categories. The set of available categories is typically large (Amazon has listed over 50000), changes constantly, and new products have to be added on a daily basis. Mistakes can be costly, because miscategorized products not only look confusing and unprofessional, they also cannot be sold when customers are not able to find them.

Product Categories for a Fashion Website

To improve the process of product categorization, we looked into methods from machine learning. Our goal was to develop a machine learning system that can predict which categories fit best to a given product, in order to make the whole process easier, faster and less error-prone. In this blog post, I am going to walk you through the problems we faced on the way and how we decided to solve them.

Challenges

From the perspective of machine learning, there are some unique challenges to the problem of predicting product categories:

The class set is very large. Machine learning applications typically only have to predict between a few selected classes (e.g. classifying an email as spam or no-spam), but in e-commerce there are often hundreds or thousands of categories that need to be classified. To train robust models for these cases, you need a particularly large amount of training data. Product data is diverse and unbalanced. One product can have a detailed set of attribute data, which is completely missing in another product (e.g. color, size or material of a t-shirt vs. expiration date, fat content or volume of a milk). Taking into account all available product variables would lead to an explosion of missing values, which makes model convergence that much harder. To handle this problem, we decided to keep it simple and only use product names, images and descriptions as our predictor variables, because they are available for the majority of products and arguably carry the most important information. Every online shop has a unique category structure. At commercetools, we offer a cloud-based API to manage commerce platforms (dealing with processes related to products, orders, customers, carts, payments, etc.). Our API is designed with flexibility in mind, so our customers operate in very different industries (such as fashion, groceries, agriculture, home supplies, winter sports or jewelry). When building our machine learning system, we need to make sure that only categories relevant for a specific shop are being recommended.

One way to handle this problem would be to train separate models for the products and categories of each online shop, but there are a bunch of problems with this approach: Some stores might not have enough product data for models to converge, classes are more likely to be unbalanced (e.g. a fashion shop might have a lot of data for the category “t-shirts” but only a few cases for “bandanas”), models need to be retrained frequently to account for newly added categories, and the infrastructure to handle all these models and their different versions in production can get quite complicated.

For these reasons, we took a different approach and trained one general-purpose model that covers a broad range of categories and can be used by all of our customers. To map these general categories to store-specific categories, we use a separate machine learning model that quantifies similarities between words (i.e. to identify that the general category “jeans” corresponds to the category “Fashion > Men > Jeans” in store A, but “Clothing > Pants” in store B; more on that later). With this approach, we lose some accuracy for categories that are very customer-specific, but we gain more flexibility, more data to handle unbalanced classes, better support for smaller stores, and less maintenance costs.

To recap our goal: We want to build a machine learning system that predicts a broad range of product categories from names, images or descriptions. These categories are then mapped to store-specific categories through a separate machine learning model. Our main coding language to build this system is Python.

Class Set

Defining a good class set for this problem is a bit of an art form. If the set is too small, you might miss some categories that are important for a specific store. But if the set is too large, prediction accuracies will drop significantly. We tested both larger and smaller sets and ended up with a set of 723 categories in our current version. The set is composed of rather broad terms, typically consisting of just one or two words. Here are some examples:

Examples of Model Categories

When we started, we tried to predict as many categories as we could imagine to ensure that we have a good coverage. But it is important to realize that our customers are not an evenly distributed group across the entire category landscape, but are more like clustered islands of particular industries. Even though it is against the very nature of every data scientist, we found that it is actually a good idea to “overfit” our models to these islands, since these are the actual use cases of our customers. When we get customers from entirely new industries, we can still adapt the class set, so that our models can evolve together with the needs of our customers.

Now that we know what we want to predict, we can start looking at the variables we want to use for the predictions.

Image Classifier

Examples of Product Images

When it comes to image classification, convolutional neural networks are undoubtedly the gold standard, mainly because of their ability to identify and combine low-level features (lines, edges, colors) to more and more abstract features (squares, circles, objects, faces). We find a similar mechanism in the visual cortex of the human brain, so this algorithm must be doing something right. With the main breakthrough of convolutional neural networks in 2012, researchers have consistently improved their architecture in the context of the annual ImageNet competition, which is something like the olympics of computer vision. In 2017, the winning contribution reached an impressive top-5-accuracy of ~97.8%.

Remember when I told you that we are going to need a particularly large amount of data to train a robust model for all these product categories? We are going to cheat a little here with an approach called transfer learning. Training a state-of-the-art convolutional neural network from scratch takes a lot of time, data and computational resources, so we are just going to take a neural network (Inception v3) that has already been pre-trained on a large image dataset. This model has already learned a lot about extracting and combining image features, but it does not yet know which categories we want to have predictions for in our particular use case. To bend the network to our needs, we simply cut off the final classification layer, add a new layer with 723 units corresponding to our product categories, and then retrain only these weights with our own dataset, while all the other model weights are frozen. This allows us to build robust, customized classifiers with relatively little effort and data, in our case with a dataset of ~130000 images (~100–200 for each category).

Architecture of the Neural Network Inception v3

We used the library TensorFlow in Python to implement this approach and ran model training on the Google Cloud ML Engine (with modified code from this example). The code we wrote on top of that mainly deals with downloading images from urls, converting them to jpeg, rescaling them, sorting them into subfolders for each category, removing duplicates or invalid files, and uploading bottlenecks and trained models to Google Storage. There is still quite some manual work involved in double-checking all the images, because unfortunately not all images assigned to a product are representative images for a category that you want your network to learn (images of usage instructions, generic company logos, low quality images, etc.).

Text Classifier

Next, we built our classifier for product names.

Examples for Product Names

After cleaning up duplicates, uninformative names, and balancing our dataset, we ended up with~230000 samples (~300 for each category). To make the names easier to deal with and reduce their dimensionality, we first run them through a preprocessing pipeline (mainly using the libraries re and spacy):

Lowercasing all letters. Removing punctuation and special characters (like *, | or .). We keep hyphens to preserve information in cases like “t-shirts”. Removing stopwords (the, and, in, etc.) because we do not expect them to have much predictive value. Lemmatizing words (≈finding word stems) to remove variance from word inflection (i.e. we want our model to know that “apples” and “apple” refer to the same thing).

We experimented with removing very short words (1–3 letters) and automatic spelling correction, but excluded these steps in the end because they did not lead to a better performance.

After that, we want to convert our preprocessed text samples into numbers, because this is the only language that machine learning models can work with. We tested several methods to accomplish this:

Bag-of-words: Each sample is converted to an n-dimensional vector corresponding to the set of unique words in the dataset, with values of the respective word frequencies in the current sample. Easy to do, but ignores syntax and leads to very sparse vectors (≈high-dimensional space with a lot of zeros) which complicate model training. TF-IDF (term frequency-inverse document frequency): Similar to bag-of-words, but weighs word occurrences in a text sample higher when the words are rare in the rest of the dataset, since these words are likely to be more descriptive of the sample. Further, words with a high overall frequency in the dataset can be excluded from the lexicon. As a result, both the impact of non-informative words as well as the dimensionality of the vector space can be reduced. Word2Vec: Solves the sparsity problem by training a two-layer neural that predicts the context for a given word (i.e. the word “Nike” will be more often next to the word “shoes” than “bananas”). This is more complex to compute, but manages to create a low-dimensional text representation that encodes subtle semantic similarities between words and is easier for classifiers to train on.

We achieved the best results with TF-IDF. Even though Word2Vec definitely outperforms TF-IDF in tasks that include complex semantic relationships between text samples, it is an overkill for our use case, since product names are rather simplistic and have barely any syntax in them.

After preprocessing and vectorization, we can finally build our actual text classifier. We tested prediction accuracies for a range of machine learning models in the library scikit-learn: Naive Bayes, Logistic Regression, k-Nearest Neighbors, Random Forests, Support Vector Machines and Gradient Boosting. Logistic Regression with TF-IDF vectorization performed best, which shows yet again that you do not always need the most sophisticated and complex techniques to achieve the best results.

We used the same setup to develop a classifier for product descriptions. Even though descriptions have more complex syntax than product names, the same combination of Logistic Regression and TF-IDF achieved the highest accuracies.

Category Matching

Now, we have classifiers for images, names and descriptions that generate probabilities for our 723 categories. How do we integrate the predictions from the different models? You can get more fancy here and train a so-called ensemble model, which is basically a higher-order machine learning model that takes as its input the output of other models. But for our purposes, it was enough to compute the mean of the class probabilities of each model to generate our final predictions.

We can get general category predictions now, but we still need a mechanism to match these categories to store-specific categories, such that we do not bombard our customers (i.e. different online shops that use our API) with category recommendations that they have not defined. If our model says this product belongs to the category “bracelets”, we need to know to which categories in the store that made the request this fits, since there can be quite some variance. For this task, we used the library gensim to train a Word2Vec model on a large corpus of Google News articles, which is commonly used to estimate word similarity. Since it can take a while to compute the similarity between all model categories and all store categories with this model, we precompute these similarities and store them in a database, which is updated every night.

Like the class probabilities, category similarities are also scaled to the range between 0 and 1, and we count every value above 0.6 as “similar enough” to match a model category to a store-specific category. To account for the variance in the similarities between matches, we multiply the probabilities of our class predictions with these similarity scores to quantify our confidence in the final category predictions.

We achieved our highest accuracies above 90% for stores in the fashion or jewelry industries, whereas our lowest accuracies of 70–80% were for stores in the grocery and home supplies industries, which mainly stems from the fact that the latter industries have a significantly larger category set and a lot more diversity in their data.

API

To expose our application, we wrote a HTTP API in the library flask. We have two endpoints, one for general model predictions and one for store-specific predictions. The general endpoint is mainly used to test the behavior of our classifiers for different images, names or descriptions:

API endpoint for general category recommendations

In contrast, our store-specific endpoint has the following workflow: First, it takes as input parameters the project key of the store and the id of the product that needs category recommendations. We then look for data on images, names or descriptions in our database and pass it to our machine learning classifiers. After that, the model predictions are matched to the store-specific categories and the ids of the most likely categories are returned.

API endpoint for project-specific category recommendations

Confidence scores can get rather low due to the large class set and the similarity matching procedure, but the important part is the relative difference between confidence scores to identify the most relevant categories.

We use the store-specific API to generate recommendations in our user interface called the merchant center, where our customers have access to a range of features to manage their commerce platforms.

User Interface Integration of the API

The feature is currently in the beta testing phase and the API documentation can be found here. We are looking forward to iteratively adapt the feature through feedback from our customers, since there are many knobs in our development pipeline that we can fine-tune to improve the application.