Natural Language Processing (NLP) is a messy and difficult affair to handle. Preprocessing, machine learning, relationships, entities, ontologies and what not.

Word embeddings/representations – ever since they came in with great work of Mikolov et al, they have been revolutionary to say the least. The concept itself is very intuitive and motivates deeper understanding fora wide range of applications. The main advantage of the distributed representations is that similar words are close in the vector space, which makes generalization to novel patterns easier and model estimation more robust. Distributed vector representation is showed to be useful in many natural language processing applications such as Named Entity Recognition (NER), Word Sense Disambiguation (WSD), parsing, tagging and machine translation.

They can be generated by two techniques – word2vec and GloVe, pioneered by Google and Stanford respectively. Each of the models have different approaches but have similar results. Both models learn vectors of words from their co-occurrence information. Broadly, they differ in that word2vec is a “predictive” model, whereas GloVe is a “count-based” model. You can read more in this paper.

I have been experimenting with both of them off late, using their models with Gensim. However, I used to get error messages like these when trying to import GloVe vectors in Gensim. After some research, I found that word2vec embeddings start with a header line with the number of tokens and the number of dimensions of the file. This allows Gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file. I have written a hack for the same purpose – somewhat ugly, but handle memory constraints well. Take a look here.