The automated anonymization of documents is an extremely important requirement for many companies and industries. One canonical example is in the medical industry, where the privacy of patient data is taken very seriously. However, automatically anonymizing text documents is a difficult task and an active area of research. Beyond simply recognizing entities, variation in the types of documents (e.g., financial or medical) and type of identifying information poses a challenge for automated systems.

One common approach is to develop domain specific anonymization tools, where one can utilize knowledge about the structure and information content of documents to construct high quality anonymization models. During my time at Insight I took on an ambitious project — developing an algorithm to anonymize general documents for a procurement company called Bonfire. En-route to a solution, I explored an approach that incorporates word vector embeddings. With some hints from statistical physics, I was able to develop a new and improved embedding model which resulted in high performance anonymization.

Anonymizing text documents for Bonfire

As an Insight Data Science Fellow I decided to work on a challenging problem for Bonfire, a company based in Ontario. Bonfire’s mission is to shape the future of public purchasing by building a platform that helps organizations (universities, hospitals, etc.) setup their procurement processes. The traditional procedure of procurement is often very tedious and intimidating, especially during the initial steps of preparing numerous documents (jargon: Requests For Proposals, or RFPs). One of Bonfire’s solutions to facilitate this process is to unlock document sharing for similar purchases. However, this requires user information — anything that identifies the user — to be removed. Here is where I come in to help, by building a system to anonymize approximately 200,000 documents produced by a few hundred of users.

Accomplishing this task is by no means as simple as search-and-replace. When composing documents, users tend to refer themselves in various ways that cannot be easily put into a pattern. For example, there are often descriptions of building names, street names or even some special or unique events. As an example consider the follow document:

As a starting point, it seems like standard named entity recognition (like packages from Stanford’s NLTK or Spacy) would be suitable to find the words and tokens we want. However, testing this against hand labelled examples I found a very low success rate on the FAQ-style of documents that Bonfire has, perhaps due to the unnatural flow of sentences.

Stepping back, the problem is essentially the reverse process of text classification. In other words, how much information should we remove until the text cannot be classified into identifying/non-identifying? It boils down to how we define private information. Here is how “privacy” is defined on Wikipedia:

Privacy is the ability of an individual or group to seclude themselves, …

Following this definition, what I am after is the information that exclusively belongs to a particular user, but not to others. Moreover, if there are named entities in a document that are common across all documents, my algorithm should not (necessarily) pick them out. This means I will need to find a way to segment the information within a document according its closeness to a particular user.

Word vectors and context binding

In order to tackle Bonfire’s anonymization problem, we need a way of modeling the information content surrounding a particular word. This closeness of information can be quantified through word embedding models like word2vec (here specifically a continuous skip-gram model). The algorithm trains a vector representation of the vocabulary by binding the vector to the continuous context (neighboring words). For any word w and its context (or neighbor) word c, this can be achieved by minimizing the negative log-likelihood

where v is a mapping from the vocabulary to a vector space, typically encoded by a giant matrix of a size of the vocabulary (number of unique words) and the embedding vector space dimension. In many cases two different matrices — an input weight matrix and an output weight matrix — are used to do the mapping from words to vectors, and from vectors to words.

Let’s break the negative log-likelihood down. The first term favors mappings (v) which make word-context pairs ⟨w,c⟩ more likely. To assure good embeddings, we want to penalize mappings which create an association to out-of-context words. The second term does this by disfavoring similarity between the word and (a number of Nc) randomly sampled out-of-context words n. In many ways, this negative sampling can be viewed as regularizing the word embedding.

The context binding properties of word2vec provide a way to separate out common words. This can be visualized in the cartoon below. Intuitively, when a common word shows up in multiple contexts it will also appear frequently in the negative sampling term of the model, causing common words to pull apart from the context that is being embedded. As a result, words that are private (non-common) will be bonded closer in vector space while common words are pulled father away.

Illustration of the effects of word2vec embeddings with negative sampling. Since common words appear in multiple contexts, they are moved further away from a given word when constructing the embedding. What is left are more unique and private words which are relevant to a given document.

When investigating how neighboring words interact in word embedding models, I realized these models have an interesting connection with physics. To start, we can look at an alternative form of Eq. (1). If we ignore the negative sampling term for a moment, we will arrive at

Restricting v to be unit vectors, this is nothing but the (well known) one-dimensional O(n) model in statistical physics. (Note: Strictly speaking the analog is exact only when the input weight matrix and output weight matrix are identical.)

The O(n) model or n-vector model is a simplified physics model that explains how macroscopic magnetization is formed as an effect of grids of the constituent tiny magnets (or spins) in a material. A strong magnetic field is formed if all the tiny magnets are orientated in the same direction, while a little or no magnetism is present if they are randomly orientated. Because they interact with each other, every tiny magnet is simultaneously affecting (and is affected by) its neighbors. In the physical world, such systems will settle around the configuration where the total energy (defined in Eq. (2)) is at a minimum.

Like the O(n) model, we are trying to find a configuration of all the word vectors such that the total “energy” (negative log likelihood) is at a minimum.

Building an initial solution using FastText

My first solution for Bonfire was based on FastText, an open source library developed by members of Facebook AI Research’s team. FastText is an improvement on the skip-gram model that we just discussed, adding an important ingredient that is highly relevant to our interest here: sub-word information. Instead of representing text as only whole words, FastText considers character n-grams. Adding this extra capacity reduces out-of-vocabulary scenarios that word2vec might encounter, since FastText can also bind words that are morphologically similar (such as typos or lemmas). This capability allows the model to achieve some of the functionality of regular expressions, which is crucial for Bonfire’s anonymization task.

To construct my initial anonymization model, I turned the entire set of documents into one continuous word array, and fed it to FastText to learn word vector representations. Once computed, the word vectors allow us to directly compare and associate words to each other by simply computing the cosine similarity between them. Returning to our example, a term like “Utopia University” has a high similarity score with “UtopiaU” and “Kerr Hall”. While simple, it turned out that this worked rather well — in a test data set, it yielded low false positive rate (3~4%) and identified 75~80% of the sensitive words.

Going Further: Building PolarizedText

Looking carefully at the above approach, there is one clear downfall — “Utopia U” may appear in similar contexts (and similar RFPs) as “Dystopia U”. Due to these overlapping or similar contexts, the word vectors that belong to different users may end up being close. Ultimately this means the model will be less able to find the pertinent entries for a given user/entity.

A workaround can be found in our analog with the O(n) model. That is, comparing it to physical modeling, Eq. (2) leaves out one term of the O(n) model. In presence of an external magnetic field μ, the correct total energy should be

The physical implication here is straightforward: every tiny magnet is influenced by its neighbors while living under the shadow of the external field. This external field should on the average make them more aligned.

We can treat the words in a labeled document as essentially under the shadow of a “external field”, namely the label. In analogy to the O(n) model, we are missing the “external field” component in the negative log-likelihood of the skip-gram model. Inspired, I added a term

which I call a polarization term, to the skip-gram model. A summary of the analogy is given in the following table:

To integrate the polarization term into my solution for Bonfire, I modified the FastText source code and have made it available in a Github repository called PolarizedText. PolarizedText puts each document label into its unique dimension and applies an “external field” (the label) to shadow the text embedding process.

Incorporating this should help us to find a more probable configurations of word vector representations, just as what is implied in physics, because it utilizes labelled information important to our task.

Looking at the Results

To validate the algorithm I hand labeled about 300 documents, isolating the identifying words Bonfire would like to remove. As a quick summary, the ROC curves of the various methods are compared below:

The above curves are obtained by varying the removal threshold (word vector cosine similarity), such that curves which are closer to the top left perform better. “fasttext(supervised)” is a simple approach which uses anonymization labels, but ignores word context when forming embeddings. In contrast “fasttext(skipgram)” does not use labeled data but enforces context binding (as discussed above). In a sense “polarizedtext” synthesizes the two approaches, and the plot indicates that it is beneficial to do so. The anonymizer based on PolarizedText outperforms that based on FastText. As a result these experiments indicate that the polarization step helps to construct good word embeddings, just as the O(n) model suggests.

Closing Thoughts

I found it to be a great pleasure working to solve Bonfire’s anonymization problems. Going into the project, it was unclear how challenging the project would be or what the data would look like. It was a welcome challenge, and it was quite surprising and exciting that word embedding models could actually be improved using knowledge from well known, and well studied, physical models. On Bonfire’s end they were more than happy with the result, as it is unlocking new applications for them and will be in production shortly.