Data Science is more than just statistics and machine learning on numbers. A lot of data is "unstructured," which means text (or worse, both text and numbers). While natural language processing has been around for half a century, its importance in the fields of Big Data and Data Science is growing and can no longer be ignored if one is to maintain competitive advantage.

There is a planet full of tools, and herein I describe one grain of sand out of that planet: Semantic Similarity Metrics.

Given a document of text (e.g. a Facebook posting or an e-mail), we can turn it into a set of words or a bag of words. A bag of words is like a set of words, except it also includes the multiplicity. E.g. the miniature document "Now, come now" represented as a set of words would be {"now", "come"} whereas as a bag of words would be

Word Freq now 2 come 1

Sets of words and bags of words can alternatively be considered as Boolean vectors and numeric vectors, respectively.

A common need when processing documents is to evaluate their similarity, e.g. to determine if they are duplicates, or to determine how close a sample document might be to a "reference" document (e.g. for automated essay scoring). There are various similarity metrics available, for both Boolean and numeric vectors.

Similarity Metrics for Boolean Vectors

Recall that what we mean by "Boolean Vectors" are really just sets, and it is easier to think about and discuss these as sets rather than as literal Boolean vectors, so we use set notation.

Jaccard Index

The Jaccard Index is the simplest metric:

\[\frac{\left|A \cap B\right|}{\left|A \cup B\right|}\]

Dice-Sørensen

The Dice-Sørensen (aka just Dice or just Sørensen) is similar to the Jaccard.

\[\frac{2\left|A \cap B\right|}{\left|A\right| + \left|B\right|}\]

They both give scores in the range [0,1]. But Dice emphasizes similarity, especially in the cases where one set is larger than the other. However, Dice does not satisfy the triangle inequality and thus is not a true metric in the mathematical sense of the word.

Tversky

Tversky is a generalization of Jaccard and Dice, in that Jaccard and Dice become just special cases of Tversky:

\[\frac{2\left|A \cap B\right|}{\left|A \cap B\right| + \alpha\left|A-B\right| + \beta\left|B-A\right|}\]

We arrive at Jaccard with \(\alpha=\beta=1\) and at Dice with \(\alpha=\beta=0.5\). But by varying \(\alpha\) and \(\beta\) to be different from each other, we can apply Tversky to situations where we wish to treat documents asymmetrically. For example, if instead of documents A and B that are treated equally, we have a reference set R (perhaps some sort of answer key) and a user set U, then by setting \(\alpha\) high we can "punish" the user for missing words that were expected in R. Alternatively, we could set \(\beta\) high to "punish" the scoring for not finding the best R that best matches the user input U.

Similarity Metrics for Numeric Vectors

Instead of having seta A and B, we now consider numeric vectors X and Y, which are frequency counts in our bag of words.

Tanimoto

The Tanimoto metric is the numeric vector generalization of the Jaccard index for Boolean vectors:

\[\frac{X \cdot Y}{\left|X\right|^2 + \left|Y\right|^2 - X \cdot Y}\]

Here, the dot represents the vector dot product.

Cosine

The cosine similarity metric is similar in appearance to Tanimoto:

\[\frac{X \cdot Y}{\left|X\right| \left|Y\right|}\]

The cosine has the appealing property that 0 means a 90 degree separation, or complete orthogonality.