To actually visualize the data (i.e. represent each product as a mappable vector), we can only use the reviews for each product. Thus, we are coping with a many-to-one embedding. After filtering out ASINs with less than 10 reviews, we end up with 97,249 unique ASINs and 6,875,530 reviews.

We don’t perform any kind of preprocessing for the textual data.

Why?

Because.

From Review to Product Embedding

To obtain an embedding for each review, we first need some kind of pretrained embedding. It is likely that reviews may contain many unknown words. Luckily, fse offers support for FastText models out of the box. We first load the publicly available FastText model:

from gensim.models.keyedvectors import FastTextKeyedVectors

ft = FastTextKeyedVectors.load("../models/ft_crawl_300d_2m.model")

Next, we instantiate the SIF model from fse.

from fse.models import SIF

model = SIF(ft, components=10, lang_freq="en")

The number of components is equal to 10, which was reported in the STS benchmark reproducibility section.

Notice the lang_freq argument. Some pre-trained embeddings do not contain information about the word frequency in a corpus. fse supports the induction of word frequencies for pre-trained models for multiple languages, which are essential for the SIF and uSIF models (this might take a while depending on the size of your vocabulary).

All fse models require the input to be a list of tuples. The first entry of the tuple is a list of tokens (the sentence) and the second represents the index of the sentence. The latter determines the target row in the embedding matrix where the sentence is added to. We will later write multiple sentences (reviews) on a single row (many-to-one).

s = (["Hello", "world"], 0)

fse provides multiple input classes, which all provide a different functionality. There are 6 input classes to choose from:

IndexedList: for already pre-splitted sentences.

C IndexedList: for already pre-splitted sentences with a custom index for each sentence.

IndexedList: for already pre-splitted sentences with a custom index for each sentence. SplitIndexedList: for sentences which have not been splitted. Will split the strings.

Split C IndexedList: for sentences which have not been splitted and with a custom index for each sentence.

IndexedList: for sentences which have not been splitted and with a custom index for each sentence. C SplitIndexedList: for sentences which have not been splitted. Will split the strings. You can provide a custom split function.

SplitIndexedList: for sentences which have not been splitted. Will split the strings. You can provide a custom split function. C Split C IndexedList: for sentences where you want to provide a custom index and a custom split function.

Split IndexedList: for sentences where you want to provide a custom index and a custom split function. IndexedLineDocument: for streaming sentences from disk. Is indexable to ease searching for similar sentences.

These are ordered by speed. Meaning, that IndexedList is the fastest, while CSplitCIndexedList is the slowest variant (more calls = slower). Why multiple classes? Because I wanted to have each __getitem__ method to have as few lines of code as possible not to slow down the computation.

For our review data we are using the SplitCIndexedList, as we do not want to pre-split the data (pre-splitting 7 mio. reviews takes up a huge amount of ram). Internally, the class will point to our reviews and only perform the pre-processing when you call __getitem__.

from fse import SplitCIndexedList review = ["I really like this product.", "Its nice and comfy."]

s = SplitCIndexedList(review, custom_index = [0, 0])

print(s[0])

print(s[1]) >>> (['I', 'really', 'like', 'this', 'product.'], 0)

>>> (['Its', 'nice', 'and', 'comfy.'], 0)

Notice, that both sentences point to index 0. Therefore, they will both be added to the embedding with index 0. To map each ASIN to an index, we only need some further convenience functions.

from fse import SplitCIndexedList ASIN_TO_IDX = {asin : index for index, asin in enumerate(meta.index)} indexed_reviews = SplitCIndexedList(data.reviewText.values, custom_index = [ASIN_TO_IDX[asin] for asin in data.asin])

Now that we have everything ready, we can just call

model.train(indexed_reviews)

The model was trained on a cloud instance with 16 cores and 32 GB ram. The whole process takes around 15 minutes, with about 8500 reviews / second. Each review contains 86 words on average and in total we encounter 593,774,622 words. We compressed about 7 million reviews into a matrix with a shape of 100,000 * 300. Adding further workers makes no difference, because the preprocessing (splitting) is the bottleneck.

If your data is already pre-splitted, you may reach up to 500,000 sentences / second on a regular MacBook Pro. Take a look at the tutorial notebook if you want to learn more.

Using & Visualizing the Embedding

After training the sentence embedding we can access each individual embedding by its index or the complete embedding matrix. The syntax is as close to Gensims syntax as possible for easy usage.

model.sv[0] # Access embedding with index 0

model.sv.vectors # Access embedding matrix

The corresponding Sentencevectors (sv) class provides quite a few functions to work with the resulting sentence embeddings. For example, you can use similarity, distance, most_similar, similar_by_word, similar_by_sentence, or similar_by_vector.

Visualizing this much data may take extremely long if we revert to the standard sklearn t-SNE implementation. Thus, let us try a more optimized approach: FIt-SNE. This optimized t-SNE implementation makes use of Fourier Transformations to speed up the computation of t-SNE. Feel free to read the paper [4]. It works like a charm and uses all 16 cores of the machine.

import sys; sys.path.append('../FIt-SNE')

from fast_tsne import fast_tsne mapping = fast_tsne(model.sv.vectors, perplexity=50, seed=42)

Having computed the mapping, we are effectively done with the most difficult part. Adding some information from the meta data to each point, we can finally export everything to Tableau.

Tableau Mapping

To access the corresponding graphics, visit my public tableau page:

You can hover over each point and obtain information about product price, name, brand, and so on.

Looking at the embedding, we observe that there is quite a lot of information contained in each cluster.