Learning Phrases From Unsupervised Text (Collocation Extraction)

We can easily create bi-grams with our unsupervised corpus and take it as an input to Word2Vec. For example, the sentence “I walked today to the park” will be converted to “I_walked walked_today today_to to_the the_park” and each bi-gram will be treated as a uni-gram in the Word2Vec training phrase. It will work, but there are some problems with this approach:

It will learn embeddings only for bi-grams, while many of this bi-grams are not really meaningful (for example, “walked_today”) and we will miss embeddings for uni-gram, like “walked” and “today”. Working only with bi-grams creates a very sparse corpus. Think for example about the above sentence “I walked today to the park”. Let’s say the target word is “walked_today”, this term is not very common in the corpus and we will not have many context examples to learn a representative vector for this term.

So, how we overcome this problem? how do we extract only meaningful terms while keeping words as uni-gram if their mutual information is strong enough? As always, the answer is inside the question — mutual information.

Mutual Information (MI)

Mutual information between two random variables X and Y is a measure of the dependence between X and Y. Formally:

Mutual Information (MI) of random variables X and Y.

In our case, X and Y represents all bi-grams in corpus such that y comes right after x.

Pointwise Mutual Information (PMI)

PMI is a measure of the dependence between a concrete occurrences of x of y. For example: x=walked, y=today. Formally:

PMI of concrete occurrences of x and y.

It’s easy to see that when two words x and y appear together many times, but not alone, PMI(x;y) will have a high value, while it will have a value of 0 if x and y are completely independent.

Normalized Pointwise Mutual Information (NPMI)

While PMI is a measure for the dependence of occurrences of x and y, we don’t have an upper bound on its values [3]. We want a measure that can be compared between all bi-grams, thus we can choose only bi-grams above a certain threshold. We want the PMI measure to have a maximum value of 1 on perfectly correlated words x and y. Formally:

Normalized Pointwise Mutual Information of x and y.

Data-driven Approach

Another way to extract phrases from text is by using the next formula [4] that takes into account the uni-grams and bi-grams count and a discounting coefficient for preventing of creation of bi-grams of too rare words. Formally:

Read this article for more details.

Extract Similar Phrases

Now that we have a way to extract meaningful bi-grams from out large unsupervised corpus, we can replace bi-grams with a NPMI above a certain threshold to one uni-gram, for example: “inflection point” will be transformed to “inflection_point”. It’s easy to create tri-grams by using the transformed corpus with bi-grams and running again the process (with a lower threshold) for form tri-grams. Similarly, we can continue this process to n-grams with a decreasing threshold.

Our corpus consists of ~60 million sentences that contain 1.6 billion words in total. It took us 1 hour to construct bi-grams using the data-driven approach. Best results achieved with a threshold of 7 and a minimum term count of 5.

We measured the results using an evaluation set that contains important bi-grams that we want to identify, like financial terms, people names (mostly CEOs and CFOs) cities, countries, etc. The metric we used is a simple recall: from our extracted bi-grams, what is the coverage in the evaluation test. In this specific task, we care more about the recall instead of the precision so we allowed our self to use a relatively small threshold when extracting the bi-grams. We do take in consider that our precision might get worse when lowering the threshold and in turn we might extract bi-grams that are not very valuable, but that’s preferable than missing important bi-grams, when performing Query Expansion task.

Example Code

Reading corpus line by line (we assume each line contain one sentence) in a memory efficient approach:

def get_sentences(input_file_pointer):

while True:

line = input_file_pointer.readline()

if not line:

break



yield line

Clean sentences by trimming leading and trailing spaces, lower case, remove punctuation, remove unnecessary characters and reduce duplicate space into a single space (note that this is not really necessary because we later on tokenize our sentence by space character):

import re def clean_sentence(sentence):

sentence = sentence.lower().strip()

sentence = re.sub(r’[^a-z0-9\s]’, '’, sentence)

return re.sub(r’\s{2,}’, ' ', sentence)

Tokenize each line by a simple space delimiter (more advanced techniques for tokenization exist, but tokenize by a simple space gave us good results and works well n practice), and remove stop-words. Removing stop-words is task dependent and in some NLP tasks, keeping the stop-words yields better results. One should evaluate both approaches. For this task, we used Spacy’s stop-word set.

from spacy.lang.en.stop_words import STOP_WORDS def tokenize(sentence):

return [token for token in sentence.split() if token not in STOP_WORDS ]

Now, that we have a representations of our sentences by a 2-d matrix of cleaned tokens, we can build bi-grams. We will use Gensim library that is really recommended for NLP semantic tasks. Fortunately, Genim has an implementation for phrases extraction, both with NPMI and the above data-driven approach of Mikolov et al. One can control the hyperparameters easily, like determining the minimum term count, threshold and scoring (‘default’ for data-driven approach and ‘npmi’ for NPMI). Note that values are different between the two approaches and one needs to take it into account.

from gensim.models.phrases import Phrases, Phraser def build_phrases(sentences):

phrases = Phrases(sentences,

min_count=5,

threshold=7,

progress_per=1000)

return Phraser(phrases)

After we finish building the phrases model, we can save it easily and load it later:

phrases_model.save('phrases_model.txt') phrases_model= Phraser.load('phrases_model.txt')

Now that we have a phrases model, we can use it to extract bi-grams for a given sentence:

def sentence_to_bi_grams(phrases_model, sentence):

return ' '.join(phrases_model[sentence])

We want to create, based on our corpus, a new corpus with meaningful bi-grams concatenated together for later use: