In this post, we will see two different approaches to generating corpus-based semantic embeddings. Corpus-based semantic embeddings exploit statistical properties of the text to embed words in vectorial space. We will be using Gensim which provided algorithms for both LSA and Word2vec.

Basics difference

Word2vec is a prediction based model i.e given the vector of a word predict the context word vectors(skipgram).

LSA/LSI is a count based model where similar terms have same counts for different documents. Then dimensions of this count matrix is reduced using SVD.

For both, the models similarity can be calculated using cosine similarity.

Is Word2vec really better

Word2vec algorithm has shown to capture similarity in a better manner. It is believed that prediction based model capture similarity in a better manner.Should we always use Word2Vec?

The answer is it depends. LSA/LSI tends to perform better when your training data is small. On the other hand Word2Vec which is a prediction based method performs really well when you have a lot of training data. Since word2vec has a lot of parameters to train they provide poor embeddings when the dataset is small.

Latent Semantic Analysis

Latent semantic analysis or Latent semantic indexing literally means analyzing documents to find the underlying meaning or concepts of those documents. In this approach we pass a set of training documents and define a possible numbers of concepts which might exist in these documents. And the output of this LSA is essentially a matrix of terms to concepts.

We basically start with a word by document co-occurance matrix and apply normalization to weights of uninformative words(Think tfidf). Finally we apply SVD(Singular value decomposition) to this matrix to reduce the number of features from ~10000 features to around 100 to 300 features which will condense all the important features into small vector space.

Information Retrieval Book

Word2vec

Word2vec consists of two neural network language models, Continuous Bag of Words

(CBOW) and Skip-gram. In both models, a window of predefined length is moved along the corpus, and in each step the network is trained with the words inside the window.

Whereas the CBOW model is trained to predict the word in the center of the window based on the surrounding words, the Skip-gram model is trained to predict the contexts based on the central word. Once the neural network has been trained, the learned linear transformation in the hidden layer is taken as the word representation.

Word2vec

Let’s take an example of identifying similar recipes. You can find the dataset here https://www.kaggle.com/hugodarwood/epirecipes

import gensim from gensim import corpora import pandas as pd from nltk.corpus import stopwords from nltk import FreqDist from gensim import corpora, models, similarities import logging import os import numpy as np from gensim.models import Word2Vec from annoy import AnnoyIndex logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) stopwords = stopwords.words('english') class Similarity(object): def __init__(self, data, num_topics = 10): self.WORD2VEC_LEN = 300 self.num_topics = num_topics self.data = data self.tokenized_data = self._tokens() self.freqdist = FreqDist([x for y in self.tokenized_data for x in y]) def _tokens(self): tokens = [[word for word in str(document).lower().split() if word not in stopwords] for document in self.data] return tokens def filter_tokens(self): """ Filter tokens which have occured only once """ return [[tk for tk in entry if self.freqdist[tk] > 1] for entry in self.tokenized_data] def build_dictionary(self): logging.info("In building dictionary") self.dictionary = corpora.Dictionary(self.tokenized_data) self.dictionary.save('similarity_dictionary.dict') def build_corpus(self): self.corpus = [self.dictionary.doc2bow(text) for text in self.tokenized_data] corpora.MmCorpus.serialize('similarity.mm', self.corpus) def build_lsi(self): logging.info("Building lsi model") self.lsi = models.LsiModel(self.corpus, id2word=self.dictionary, num_topics=self.num_topics) # self.lsi.print_topics(self.num_topics) self.index = similarities.MatrixSimilarity(self.lsi[self.corpus]) self.index.save('similarity.index') random_samples = np.random.choice(self.data, 10) for t in random_samples: logging.info("Which of the recipes are more similar to : {}".format(t)) doc = t vec_bow = self.dictionary.doc2bow(doc.lower().split()) vec_lsi = self.lsi[vec_bow] sims = self.index[vec_lsi] sims = sorted(enumerate(sims), key=lambda item: -item[1]) cnt = 0 tmp = set() tmp.add(t) for (x,y) in sims: if self.data[x] not in tmp: logging.info(self.data[x]) tmp.add(self.data[x]) if len(tmp) > 5: break logging.info("*" * 10) def get_vector(self, data): data = str(data).lower() return np.max([self.w2v_model[x] for x in data.split() if self.w2v_model.vocab.has_key(x)], axis=0) def build_word2vec(self): self.w2v_model = Word2Vec(self.tokenized_data, size=self.WORD2VEC_LEN, window=5, negative=10) self.annoy_index = AnnoyIndex(self.WORD2VEC_LEN) for i, rname in enumerate(self.data): try: v = self.get_vector(rname.lower()) self.annoy_index.add_item(i, v) except: pass self.annoy_index.build(50) names = np.random.choice(self.data, 10) for name in names: try: logging.info("*" * 50) logging.info("Source : {}".format(name)) v = self.get_vector(name) res = self.annoy_index.get_nns_by_vector(v, 5, include_distances=True) logging.info(res) for i,rec in enumerate([self.data[x] for x in res[0]]): logging.info("Recipe {} : {}, score: {}".format(i+1,rec, res[1][i])) except: pass def validate(self): names = np.random.choice(self.data, 20) for name in names: try: logging.info("-" * 50) logging.info("Which of the recipes are more similar to : {}".format(name)) v = self.get_vector(name) res = self.annoy_index.get_nns_by_vector(v, 5, include_distances=True) logging.info("************ Word2vec Engine **************") for i,rec in enumerate([self.data[x] for x in res[0]]): # logging.info("Recipe {} : {}, score: {}".format(i+1,rec, res[1][i])) logging.info("Recipe {} : {}".format(i+1,rec)) logging.info("************ LSA Engine**************") doc = name vec_bow = self.dictionary.doc2bow(doc.lower().split()) vec_lsi = self.lsi[vec_bow] sims = self.index[vec_lsi] sims = sorted(enumerate(sims), key=lambda item: -item[1]) tmp = set() tmp.add(name) for (x,y) in sims: if self.data[x] not in tmp and len(tmp) <= 5: logging.info("Recipe {} : {}".format(i+1,self.data[x])) tmp.add(self.data[x]) except: pass def build_lda(self): logging.info("Building lda model") logging.info("*" * 50) lda = models.LdaModel(self.corpus_tfidf, id2word=self.dictionary, num_topics=self.num_topics) lda.print_topics(self.num_topics) logging.info("*" * 50) def build(self): logging.info("In build") self.build_dictionary() self.build_corpus() self.tfidf = models.TfidfModel(self.corpus) self.corpus_tfidf = self.tfidf[self.corpus] self.build_lsi() self.build_word2vec() # self.build_lda() self.validate() if __name__ == "__main__": df = pd.read_csv('epi_r.csv') sim = Similarity(df.title.values, num_topics=100) sim.build()

And here are the results for the methods. Let me know which one do you think is doing a better job.

1) Which of the recipes are more similar to : Red Wine-Braised Short Ribs with Vegetables

************ Word2vec Engine **************

Recipe 1 : Wine-Braised Red Cabbage

Recipe 2 : Calamari with Roasted Tomato Sauce

Recipe 3 : Black Bean, Jícama, and Grilled Corn Salad

Recipe 4 : Grilled Corn on the Cob with Garlic Butter, Fresh Lime, and Queso Fresco

Recipe 5 : Green Goddess Spinach Dip

************ LSA Engine**************

Recipe 5 : Red Wine Brasato with Glazed Root Vegetables

Recipe 5 : Braised Short Ribs with Red Wine and Pureed Vegetables

Recipe 5 : Oxtail Soup with Red Wine and Root Vegetables

Recipe 5 : Red Wine–Braised Short Ribs

Recipe 5 : Red Snapper à la Niçoise

2) Which of the recipes are more similar to : Pan-Seared Salmon Over Red Cabbage and Onions with Merlot Gastrique

************ Word2vec Engine **************

Recipe 1 : Oxtail Soup with Red Wine and Root Vegetables

Recipe 2 : Celery Root and Potato Puree with Roasted Jerusalem Artichoke “Croutons”

Recipe 3 : Green Goddess Spinach Dip

Recipe 4 : Grilled Tuna with Provençal Vegetables and Easy Aioli

Recipe 5 : Slow-Braised Lamb Shanks with Guajillo-Pineapple Sauce, Roasted Vegetables, and Coconut Tamales

************ LSA Engine**************

Recipe 5 : Red Cabbage and Onions

Recipe 5 : Red Cabbage with Raspberries, Onions and Apples

Recipe 5 : Pickled Red Onions

Recipe 5 : Pickled Red Onions with Cilantro

Recipe 5 : Lime-Pickled Red Onions

3) Which of the recipes are more similar to : Julia’s Roast Chicken with Lemon and Herbs

************ Word2vec Engine **************

Recipe 1 : Crispy Roast Duck with Blackberry Sauce

Recipe 2 : Roast Cod with Potatoes, Onions, and Olives

Recipe 3 : Lemon Garlic Mayonnaise

Recipe 4 : Grilled Spiced Chicken Breasts

Recipe 5 : Grilled Lobster with Ginger, Garlic, and Soy Sauce

Recipe 5 : Roast Chicken with Lemon and Thyme

Recipe 5 : Roast Chicken Legs with Lemon and Thyme

Recipe 5 : Tarragon and Lemon Roast Chicken

Recipe 5 : Roast Chicken with Lemon and Fresh Herbs

Recipe 5 : Roast Chicken With Lemon and Butter

Let me know if you have any feedback or want me to write about any other topics.