Bot Brains: Word Mover’s Distance With a Twist

At its heart, the bot needs to be able to compare two messages (documents): the user’s input and messages already present in a channel. A standard way to compare two documents is to use bag-of-words (BoW), which includes approaches such as tf-idf and cosine similarity. However, BoW does not capture semantic properties of words, and problems arise when documents share related but not identical words (e.g. “press” and “media”).

To address this, I used Word Mover’s Distance, a novel-ish similarity metric built on top of (and leveraging) word embeddings. At a high level, word embeddings are high dimensional representations of words that capture their semantic properties (i.e. distributional semantics). Words of similar meaning “live” close to one another in this high dimensional space. I used pre-trained word embeddings from Spacy trained on the Common Crawl corpus.

With word embeddings, a natural way to estimate how dissimilar (or distant) two documents are is to look at the distance between the corresponding word vectors and, roughly speaking, add up those distances. That is the main idea behind the Word Mover’s Distance approach, and neatly, it is an instance of the well-known Earth Mover’s Distance (EMD) optimization problem, only formulated in the word embedding space.

WMD = Earth Mover’s Distance for Document Similarity

The EMD assumes that one has two vectors — let’s call them senders and the receivers — and a matrix of their pair-wise distances. Additionally, each of the vectors has a weight, a real number smaller than 1, that indicates how many “goods” each of the sender vectors has to send and how much of the goods each of the receiver vectors needs to receive. Given this formulation, the EMD can be posed as a transportation problem: given the distances (costs) between the sender-receiver pairs, determine the most efficient way to move the goods from the senders to the receivers, allowing for partial sending and receiving (so that a sender can send a portion of its goods to one receiver and another portion to another receiver). This problem is a non-trivial constrained optimization problem. Luckily, it has a known solution, which can be easily implemented in Python with the pyemd package.

WMD is the application of the EMD problem in the context of word embeddings where the senders and receivers are word embeddings of words from the first and second documents being compared, respectively. The weights of the vectors are chosen to be proportional to the number of times the corresponding word appears in the document, and the distances between the vectors are calculated using standard Euclidean distances in the word embedding space. In this way, the WMD distance between two documents can be easily calculated using the pyemd package.

O(p³ log(p)), Terrible Time Complexity

A practical obstacle in applying this method is the fact that the EMD algorithm has a terrible time complexity: O(p³ log(p)), where p is the number of unique words in the two documents. One would need to compare the user’s input to all of the previous messages in all the channels, then calculate the average distance for each of the channels, and identify the one with the smallest average distance — resulting in a prediction for the channel to which the user’s message should belong. If the user posts the message in the predicted channel, the bot doesn’t do anything; if not, the bot will advise the user to consider posting it to the predicted (i.e., correct) channel. For Slack teams that contain a lot of messages spread out over many channels, this approach will not be feasible.

Surely there are messages that are more “representative” of the channel than others. Comparing the input message to all the messages in a given channel seems excessive. It’s likely sufficient to compare the user input to those representative messages only. However, this approach would require expensive preprocessing, in which we essentially have to sort the channel messages using WMD as a key. Is it possible to construct a single message representative of an entire channel?

Slack Channel “Fingerprints”

Intuitively, we could achieve this representation by looking at the word distributions in a given channel. To a human, looking at the first 10 or so of the most frequently occurring words in a channel would give a sense of what that channel is about. A single message representative of that channel should therefore contain only those 10 (or so) words! This is where word embeddings are crucial: even if the user’s input belongs to a channel but does not contain any of the words from its representative message exactly, WMD distance will still be rather short, due to the semantic similarity between the user’s word vectors and the word vectors in the representative message.

To use the representative message in EMD / WMD, I needed to choose the weights of the vectors representing the words in it. Since the weights in a standard WMD are directly proportional to how many times a given word appears in a message, the weights in my representative message can be proportional to the number of times a given word appears in the entire channel (and then normalized). Once I constructed representative messages for each of the channels, all I needed to do was calculate the WMD distances between the user’s input message and each of the representative messages, find the shortest one, and predict the corresponding channel as the one to which the input message was supposed to go. But are the top 10 words enough to form a representative message? How about 30? I found the optimal number of top words by treating it as a hyperparameter and tuning it on a validation set. Turns out it was 180 (see below).