Introduction

Trying to utilize meaningful and useful text features is an art in itself and sometimes simply choosing a model can be an uphill battle. In this particular problem, we were trying to understand the linguistic features of the current job market. When attempting this, one has two easy options: learning from job posts or learning from resumes, the latter being more difficult to get than the former. Job posts can give you a sense of demand in a certain market and the resumes can act as supply quality metric that is trying to meet that market demand. For our task of trying to understand skill similarity and transferability of skill-assets, I turned to word2vec due to its speed and ease of use.

This, however, was not an easy decision as a myriad of awesome options exist. One such option was to use GloVe, Global Vectors for Word Representation. GloVe learns word vectors by analyzing word co-occurences within a large text corpus. It is a rather cool implementation. I recommend reading this paper for a description of how word vectors are manipulated. The reason I did not utilize this method was for the fact that I wanted to avoid storing any massive matrix in memory, albeit even if it were a sparse matrix. I like utilizing online learning whenever I can and especially with the convenient gensim-word2vec functionality that only uses in-memory computation with the actual training itself (a bit of code on that later), it made the decision a tad bit easier. Another method that I considered using for this task was the well documented Latent Dirichlet Allocation(LDA). My initial reaction was that LDA was going to be a great solution for my problem of extracting skills from text, but upon further reflection, I realized that what I wanted to do was look for features that are similar with respect to their linguistic context as opposed to words that are related via a 'topic'. For instance: python, data science, sklearn, R, and CRAN might appear in the same 'topic' with LDA, but if I only wanted python and R to appear in a given similarity query, my uneducated guess is word2vec would offer a greater probability of that happening. I suppose that might not be the best example of where my mind was going, but in my experience with LDA, a lot of noise ends up in the actual topic buckets and I really wanted to find a way to minimize that noise. So with that, I went down the path of word2vec.

Word2Vec

Word2vec is a pretty cool tool. -Biggy Smalls

Word2vec computes vector representations of words using a few different techniques, two of which are continuous bag-of-words (CBOW) and an architecture called a Skipgram. The high-level training objective of the CBOW model is to combine the representations of surrounding words to predict the word in the middle, while the training objective of the skipgram model is to learn word-vector representations that are good at predicting its context in the same sentence . It is important to note that both models are trainable in a short amount of time, but that CBOW is slightly faster and is more suited for when the dataset is larger . Considering my toy dataset only consists of ~80,000 sentences, I will use the skip-gram architecture for this post. (In my actual model consisting of a few billion sentences, I will be using CBOW).

Skipgram

Given a sequence of training words the objective of the Skipgram model is to maximize the average log probability

where k is the size of the training window. The inner summation goes from −k to k in the training window to compute the log probability of correctly predicting the word wt+j given the word in the middle, wt. The outer summation goes over all words in the training corpus. Every word, w, is associated with two learnable parameter vectors, uw and vw. The probability of correctly predicting the word wi given the word wj is defined as

where V is the number of words in the vocabulary. (Mikolov et al. 2013) An efficient alternative to the cost of computing ∇ log p(wi|wj) is the hierarchical softmax

In the computation of hierarchical softmax, the first step is to compute a binary Huffman Tree that is based on word frequencies where each word is a leaf on that tree. Here you can find a bit more information on Huffman Trees.

The binary tree acts as a representation of the output layer whereby a random walk assigns probabilities to words for each word and its child nodes. The task of predicting the target word wO, p(wO|wI) is defined by hierarchical softmax as: Here, let n(w, j) be the j-th node on the path from the root to word w, and let L(w) be the length of this path, so n(w, 1) = root and n(w, L(w)) = w. For any inner node n, let ch(n) be an arbitrary fixed child of n and let ⟦x⟧ be 1 if x is true, otherwise -1. In this instance, σ(x) = 1/(1 + exp(−x)).

Unlike the standard softmax formulation that assigns two representations and to each word w, the hierarchical softmax formulation has one representation for each word w and one representation for every inner node n of the binary tree.

So in the formula above, all the words and child nodes and have initialized embeddings which will be gradually updated. If a given context is more similar with the child-nodes of a certain word, then that word has higher probability to be the target word in the given context. The formula converts the similarity score to probabilities in order to search the subnodes of the particular word. So, if the j+1 ancestor of target word w is a child of its j ancestor, then the probability of j choosing j+1 is σ(+ similarity between context and ancestor j). Multiplying the child nodes along a path one by one gives the normalized probability of a target word given its context.

word2vec code

Instead of writing my own word2vec code, I turned to the already well documented version, gensim. Gensim's implementation of word2vec is incredibly intuitive and easy to use. I recommend looking at the docs to get a feel of how to integrate the tool into your existing pipeline.

In this instance, I collected job postings from a few job-posting websites using titles generated by Payscale. For the sake of sparing the job-posting websites' servers, I would advise one to go the api route rather than harass someone's servers with scraping calls. The data in this case is in the dictionary format of:

{'Account Manager Sales': [u'Job Description 1...',u'Job Description 2...], Sr. Data Scientist:[u'Job Description 1...']..., George W. Bush:[I was the boss at USA...],...}

The goal was to try to find enough varied job postings and resumes so that there was enough data for each job title as well as enough resumes to support the sentence structure from both sides of the job equation. Below is the code used to generate a word2vec model on an in-memory list.

An alternative to in-memory analysis, is to use Radim's method of iterating though files on multiple disks or instances and only using in-memory computation for the actual model/model training. (This is similar to what I will actually be using in my larger CBOW model.) Here is an example of what that might look like: