Product Matching with Deep Learning

Prerequisites

Despite the fact that this won’t be a very technical post, it would be nice to know a few things about,

The Problem

In Cimri.com, we gather and aggregate data from over 400 e-commerce sites that are operating in Turkey, with an aim to serve the user with best prices available for any given product and related pricing histories. In order to do that, we need a system that understands how to match different spellings of a given product, and aggregate them together so we can show them on the corresponding product page. What do we mean by that;

Imagine that you want to create a single web page for Iphone XR, and show the user which e-commerce sites are selling it and for how much. Unfortunately, you can't just crawl other sites for the data and group the same products by their title. All of these e-commerce sites are being developed and operated by different people and all of these people have a different understanding about how to write Iphone XR. Some thinks it’s necessary to add that the prodcut is a Smart Phone and some thinks it’s important to write products technical specifications in the title. And there are other cases where the warranty type is different than the more common offers and the site feels an obligation to mention that in the title and etc. So how one can understand if Iphone XR 64 GB Space Grey and Iphone XR (24 months Apple Warranty) are both describing an Iphone XR?

The “Simple” solution

Our old approach to this problem was using a word tokenizer. A system called MechOp tokenized the titles, removed the stop words and checked how much of the tokens are being shared by the two titles. In this system, after two title gets tokenized, we check

If all of the tokens are the same and if not, does one title contain all the tokens that the other token possesses.

If there are different tokens in both titles, are they significant (we do some level of statistical analysis to decide that) or has an Operator (A real life person who manually checks if two titles are a match or not) had put a synonym or a rule for the token that we can use.

EAN and ISBN codes to automatically match these products.

A system like this obviously does not understand what each token means or what the title describes but it is relatively simple to quickly search for possible match candidates and leave the decision about whether they are same or not to operators. And another benefit of such a system is that it can store and reuse previous decisions. Two different website can actually write the same title for the same product and if we match one of them, the other one will be matched automatically.

Another approach would be using regex for known patterns. We have roughly 2.5k different product categories in our system and each of them needs some degree of domain expertise to figure out what products to match. Fortunately, the products are not distributed evenly between these categories. For example, wrist watches cover nearly half of the product offerings in our system while the category of gaming consoles offer around 90 products. So what we did was to focus on the larger categories and check if we could discover some basic patterns that we can exploit. One way to achieve this is to use word clouds. When we visualize the different tokens that occurs the most between similarly written but different product titles in perfumes, we see something like this;

When we applied this technique on multiple categories, we found out that in few categories, there is not much else to consider other than things like some volume and size measurements and gender pronouns. We utilized this to reduce the error rate on our system and even to let the system decide on its own in some categories to reduce the man power needed for this task.

At the end of the day,

We didn’t solve the problem and we still had around 10–20 operators working to match up to 50k products on every working day. And if a new offer comes to our platform in the weekend, it would probably had to wait until Monday to get published. But all these tryouts gave us very precious insights about the problem and most importantly, a data set to test our more complex ideas.

Enter “Deep Learning”

What we tried to come up with was a model which would compare two sentences and hopefully, give us a score of how semantically similar two given sentences are. It was not a brand new research area and any people have already came up with novel ideas that works pretty well on data sets like The Stanford Natural Language Inference Corpus or Quora Question Pairs. We started with experimenting previously published ideas. But before all that, we needed to check and clean our data set since if you give bad data to any machine learning model, it will perform badly.

First of all, what we had was matched pairs of products. And if we wanted to teach our models anything, we had to find unrelated product titles that would be good negative examples. We could just simply pair unrelated products randomly but that would create a data set that was too easy to fit for our model since the negative samples would be very different if not completely different and positives would have a lot of common tokens. To overcome this, we used Apache Solr to search titles which belongs to different products but are written similarly. Also, we added additional measures to decrease the probability of finding the same product that we should have matched before but couldn't, like not using the first few search results depending on the category.

Another important thing is, hand curated data sets are not error proof. And ours were no exception. We later figured out that an operators error rate can go up to %3 percent. So we implemented similar measures that we used to increase the quality of our negative samples for our positive samples as well, to be a little bit safer.

After some trials and errors, we came up with our own network architecture which looked something like this;

The model consisted of an embedding layer that learns character embeddings for every character input sequence on the fly, two siamese layers of 1 dimensional convolutions, a concatenation layer for the outputs of the convolutional layers and feed forward layers that later connects to a single sigmoid output. The output neuron gives a similarity score between 0 and 1, 1 meaning the two titles are same. The results were pretty promising.

We then tested the model on different thresholds for every category on the logs of our operators (recorded after we generate our data set), which contained all the approves and declines for our match suggestions in the system.

The graphic above shows the distribution of error rates among categories for our previous system(Mechop) and our new deep learning model (CharCNN). Every bar shows the number of categories where our new model(shown with the orange bars) and our old system(shown with the blue bars) has the error rate that is shown on the X axis. As you can see, error rate of our new system is close to zero among many categories!

And if we choose to only match the pairs where our model is confident (meaning the output score is 1) the general error rate decreases even more but the trade off is having less decisions made by the model (which means more workload for the operators). The chart in the left shows how the error rate and coverage changes as the decision threshold goes to 1. The coverage of our model is described as the blue line, and the error as the orange line. Choosing 1 as the decision threshold decreases coverage close to %40 but it also decreases the error rate very close to 0 in all of the categories.

Next steps

So we now have a pretty good model that can even surpass human performance in some categories. What's next was to release it. Since the model was developed with Keras we just developed a python/flask application(that also has a ui written with vue.js where people can check similarity scores of two titles) that grabs the latest trained model from AWS S3 and serves a similarity endpoint for other services to consume. But this is not a conventional software, the main logic in this service is a deep learning model. That means, we don’t actually fully know what is happening and it is not like we can debug it like any other piece of code. So we created match lists containing product pairs which the model would match when we release it to see if there were edge cases that we should consider. For example, a Playstation 4 bundle can be considered as the same thing as Playstation 4 if the user don’t care about the extras. Operators seem to have thought that way at some point of time and thus involuntarily generated samples that made the model learn to not bother with the extras. But after some time, they seem to have changed their minds and now the models output is considered wrong. These kinds of edge cases can occur pretty easily since a model like this only learns a general knowledge about the domain and these kinds of cases are generally results of verbal discussions and are undocumented and in summary, not easy to discover. So it is a good idea to test the results manually to prevent any unpleasant surprise. In a platform like ours where popularity of each product varies, making wrong matches in popular products would render all the validation and test scores pointless in the end users eyes. After the consolidations with operators and some more testing on our behalf, we released the model for the one third of our categories.

The work does not end after the release. As I have mentioned in the early parts of this post, every e-commerce site can set different titles for their products. And their ideas of how to name a product can also change and there is nothing stopping from any e-commerce company to add product specifications or a campaign text to all of their products over night. This is a common problem for any machine learning model deployed in production. The data that the model is facing can be dramatically different than the data that has been used to train the model. Thus, having a system, that compares the incoming data to the training data and is able to catch any dramatic change can be a life saver. So, we implemented a Spark job that;

Checks the distribution of Jaro Winkler similarity scores of the latest matches and compares it to the distribution of the previous matches. If it is significantly different for a company that we work with, it sends a warning e-mail to the team.

Does data enrichment on logs while checking for distribution and persists the logs on a S3 bucket where we occasionally query with Athena and visualize with Redash so that all stake holders can look at whats going on in our system.

The results

Our model now can match between 12k to 27k products a day, depending on the incoming data, with an error rate around %0,1.

There are only 3 operators left that are manually matching products.

What's next

We haven’t used images in our new system and we are now working on how to utilize our image data to improve our product matching system.

Summary