An anomaly detector for writing style

“I decided to develop a system that could take a set of documents of known authorship and a query document of unknown authorship and decide whether or not they were written by the same person.”

Style is fundamental to how we perceive and interpret writing. A writer’s style can turn a boring subject into a work of art, or turn a fascinating subject into a mind-numbing slog. Moreover, writing style is a distinctive characteristic unique to every individual, and this distinctiveness has allowed people to unmask famous authors writing pseudonymously (1) and even determine authorship long after the writers are gone (2).

For my Insight project, I wanted to try to quantify the somewhat nebulous concept of writing style. In order to focus the problem, I decided to develop a system that could take a set of documents of known authorship and a query document of unknown authorship and decide whether or not they were written by the same person. Importantly, the system should be usable even if the author in question has never been seen before, relying only on the input data to determine whether the query is an outlier relative to the known samples. One useful application of such a tool would be for teachers trying to determine if a student is the true author of a term paper. Conversely, someone trying to write anonymously could use it to ensure that their writing is not easily identifiable.

In the course of developing this system, I drew upon the existing literature on feature engineering in natural language processing generally and stylometry in particular. The resulting high-dimensional dataset also posed challenges for my particular anomaly detection problem, which necessitated some tricks to reduce the dimensionality to a manageable size. After all was said and done, I managed to develop a system that performed pretty well on the task.

Engineering features to represent stylistic markers

I decided to use Project Gutenberg as the data source to set up my stylistic anomaly detector. The data were easily accessible and spanned a range of different styles and topics. I downloaded all the books written by the ten authors with the highest number of English language books on the site, for a total of 1,091 books. I then divided the books at sentence boundaries into chunks of roughly 1,000 characters each, yielding around 300,000 documents which were stored in a PostgreSQL database.

Once the documents were ready to go, I set out to extract features of writing style. I wanted to explicitly avoid features that were related to subject matter; the goal was to be able to identify authors regardless of what they happened to be writing about in any particular document. The distinction between style and content features is a bit fuzzy, but we can at least try to emphasize one over the other.

One thing that immediately jumps out as a stylistic marker is the author’s vocabulary. Is their writing full of ten-dollar words or do they prefer simpler language? How many different words do they use in a given 1,000 character chunk? To that end, I calculated the average word length and the number of unique words per chunk.

Another intuitive feature of style is sentence complexity, which can be roughly captured by the average sentence length. I also thought to get a more refined metric of sentence complexity by using the complexity of the dependency-parse tree of each sentence. After talking this over with a colleague, however, I realized that what I was really after was a measure of how many dependent clauses each sentence had, which can be captured by counting the number of verbs per sentence. I therefore added the mean and standard deviation (s.d.) of verb counts per sentence to the model. For good measure I also included the mean and s.d. of nouns, adjectives, and adverbs.

Finally, I calculated single word counts, character counts, and character 3-gram and 5-gram counts for each document chunk. For the final model, I selected the most common counts of each type across all documents (top 200, 100, 100, and 200 respectively). This was done in order to limit the size of the dataset in the case of character n-grams, and to preferentially select generic words rather than topic-specific words in the case of word counts. These generic words, such as articles, conjunctions, and pronouns, are not specific to particular topics but the frequency of their usage is a useful distinguishing characteristic of particular authors (3).

With this set of features, each document is converted from a 1,000 character string into a 612-dimensional vector in feature space. Before proceeding further, I first verified that these features are actually useful for identifying authors with a simple classification task. Using scikit-learn, I generated an 80/20 train/test split of the data and fit a logistic regression model. I was pleasantly surprised to see that the classifier was able to do quite well out of the box using just these features.

Confusion matrix of logistic classifier for authorship. Columns are classifier predictions and rows are actual authors. Diagonal entries show how many times the classifier got it right for each author.

While results of this model are reassuring, my ultimate goal with the project was not to make a simple classifier. Rather, I wanted a system that can determine authorship even if the author is not in the training dataset. In order to do this, I needed to recast the problem.

Reframing the problem from classification to anomaly detection

“I decided to compute distances in feature space between each sample document and the query document and use the average distance to determine authorship.”

Instead of classifying documents as belonging to one of the authors in the training set, I wanted to be able to take a writing sample assumed to be written by a certain author of interest and ask whether or not a new query document is written by the same person. In other words, I would like to be able to tell when the new document is conspicuously different from the rest of the writing samples. The problem is essentially one of detecting anomalous data points, but since the data doesn’t just follow some simple distribution, it isn’t possible to just calculate the probability of the query document given the samples.

Instead, I decided to compute distances in feature space between each sample document and the query document and use the average distance to determine authorship (4). Using the documents downloaded from Project Gutenberg, I drew a set of ten writing samples from various authors along with a query document from the same or a different author. These generated examples were used to find the optimal threshold for the distance between sample and query.

I validated the procedure on a separate set of examples generated from the Gutenberg dataset. This task is much more difficult owing to small size of the input data. Performance on this task was much poorer than on the author classification task. Below are shown ROC and precision-recall curves. The point marked with a circle shows where the optimal threshold for accuracy was located on the curve, with the best accuracy on this task being 56.1%.

Initial results for the stylistic anomaly detection task

The curse of dimensionality necessitates dimension reduction

While the anomaly detector does slightly better than chance, there’s clearly quite a bit of room for improvement. As it stands, a positive or negative conclusion from this system doesn’t really say much, since it gets it wrong almost as often as it gets it right. How can the system be improved?

The problem that we’re running into is the dreaded curse of dimensionality. In this case, the particular manifestation of the curse is that distances in high-dimensional space become less meaningful. This is illustrated in the figure below. Suppose we uniformly sample points in a unit sphere. As the dimension increases, the ratio of distances from any point to its nearest and furthest point approaches one (5). In other words, it becomes almost impossible to distinguish between nearby and faraway points.

Distances between near and far points become indistinguishable in high dimensions

In order to get around this, we can avoid using L2 (Euclidean) distances and instead use the L1 or cosine distance, which are affected by this issue to a lesser extent. However, this approach turns out to provide only a marginal improvement. As an alternative approach, I used linear discriminant analysis to reduce the dimensionality of the data while preserving information about authorship.

Dimensionality reduction with linear discriminant analysis

There are several ways we could reduce the dimensionality of our dataset. A natural thing to try first is principal components analysis (PCA), which finds a transformation such that data is aligned to the orthogonal axes that coincide with the directions of maximal variance in the data. The first principal component maximizes the variance of the full data, and subsequent components maximize variance in the orthogonal complement of all previous components. When we apply this dimensionality reduction technique, we indeed get a reasonable boost in performance of our anomaly detector (see figure below).

PCA is an unsupervised technique; it does not require any additional information about the data other than the data itself. This is often convenient, but in this case, we can actually do much better by choosing our reduced dimensions by taking into account the information about author identity that we already have in our dataset. Instead of selecting dimensions that maximize variance in the dataset as a whole, we can instead choose dimensions that provide the most benefit in performing the task at hand: identifying which author wrote a document.

I used linear discriminant analysis (LDA) to implement this idea. Linear discriminant analysis classifies points by first fitting a Gaussian distribution with a common covariance matrix Σ to each class, then finding the closest centroid to each point after sphering the data with respect to Σ. If we have k classes, the centroids lie in a k − 1 dimensional subspace of the feature space, and so we can just as well find the closest centroid to a projection of the data into this subspace. We can then perform a singular-value decomposition on these class centroids to obtain a transformation of feature space that maximizes the between-class variance (6). In our case, there are ten authors, so we can reduce the data down to nine dimensions. We thus find the nine dimensions of features space that are most useful for authorship identification.

After applying this transformation, we get a big boost in the performance of our anomaly detection system. Our best accuracy jumps up to 89.3% and the area under the ROC curve goes up to .954. For comparison, the best accuracy using the first nine principal components is 61.2% and the AUC is .662.

Performance before and after dimensionality reduction

One disadvantage of this procedure is that we have introduced a soft dependence on the training set. The dimensions that maximize the variance between the ten authors in our dataset might not be the same dimensions that are useful for another set of writing. The hope is that most human authors lie within or close to the subspace found by the Gutenberg dataset, but for best results, we might need to retrain an LDA classifier for different data. On the other hand, the boost in performance we get is significant enough that it’s probably worth the inconvenience.

Conclusion and possible extensions

To summarize, I was able to build an anomaly detection system that can identify with reasonable accuracy when a document is written in a style different from a sample of an author’s writing. We provide the system with a set of documents that we assume were written by an author and a test document that we wish to assess, and the algorithm returns a guess as to whether or not the sample and test were written by the same person.

The distinction between sample documents and query document is not strictly necessary. We could also feed in a sample of unlabeled documents and ask the system to find which documents are most dissimilar to the others, computing distances pairwise between all the documents. Another interesting application could be identifying which author wrote which paragraph in a collaborative writing scenario.

More generally, we could apply a similar distance-comparison approach to detecting anomalous observations in other domains. This approach would be particularly useful when other anomaly detection techniques are not applicable, such as when the data do not follow a standard distribution or when the data is very high-dimensional.