Baseline

One of the oldest techniques in stylometry dates back to the 1800’s, where authors were compared by using the frequencies at which they use words of different lengths. Some tend to use short two and three-letter words more often, while others tend to pull from a larger vocabulary. The technique of using frequencies of these word lengths is called Mendenhall’s Characteristic Curves of Composition (MCCC). While fairly crude, this technique can still be used to accurately identify authorship. We can determine whose curve of composition most closely resembles the curve of the anonymous text by determining the curve with the lowest average RMSE. MCCC was first tested on a small group of 7 randomly chosen Redditors after each of the 7 users’ comment histories were split into a pseudo-user in Subset 1 and the original user in Subset 2.

By iterating through each pseudo-user in Subset 1 and comparing it to each of the users in Subset 2, MCCC was in fact able to correctly identify 5 of the 7 users. While promising, this technique is too simplistic to be used at scale.

Modeling Burrow’s Delta

Many NLP applications such as topic modeling typically use Term Frequency-Inverse Document Frequency (TF-IDF) to identify rare keywords that can be used to help describe the text. In stylometry, the most important words are actually function words, common words such as “about”, “because”, “upon”, and “while”. Authors commonly write about a broad range of topics, thus heavily varying the vocabulary they use. However, function words show up in every text, and their frequency of use tends to stay fairly consistent across different documents for a given author.

For my analysis, 150 of some of the most commonly used function words were used to identify user writing styles by the Delta method. The frequencies of each function word were recorded and then standardized by subtracting the mean and dividing by the standard deviation, giving each feature’s value the representation of a z-score. The result is a 150-dimensional vector that is positive in a feature dimension where the author uses a word more frequently than the average user, and negative where it is used less than average. The vector of a pseudo-user can then be compared to that of each user most accurately by measure of cosine similarity.

The results were much improved for this method. The 7 users previously analyzed were now matched back to their correct user with 100% accuracy. Identifying users out of a random group of 40 (filtering out those who have less than 200 comments, too small of a history to identify writing tendencies) returned 95% accuracy.

Cosines of pseudo-users to their original matching accounts vs non-matches

In addition to lexical analysis, we can also distinguish unique writing styles with syntax. Using nltk’s Part-Of-Speech (POS) tagger and skip-grams, I found the 100 most commonly-used POS skip-gram sequences and vectorized each user’s frequency of use for them, as was performed with the function words. This model returned 90% accuracy for the same 40 users. However, by ensembling the two techniques together, I achieved 100% accuracy.

Scaling Up

By continuing to add random users, the model continued to perfectly predict authorship for up to 100 users. Past this point, accuracy began to slightly deteriorate all the way down to 92.2% when a user was identified out of a pool of 3,000 users.

The reason for this drop in accuracy is mostly attributable to lack of data. All users who had over 1,000 comments were still being predicted with 100% accuracy. It was those with a smaller comment history whom were not as easily identified. I also found that certain users had writing styles that were not very unique; the vast majority of their feature values hovered around the mean. This was the case with user KhasminFFBE shown below. Further, the introduction of more users increases the chance that some users will have highly similar styles of writing. If a user tends to “code switch”, i.e. change their style of writing in different contexts, they may be accidentally identified as another user that has very similar writing tendencies.

comparing the feature vectors of two users

To further improve the model, I incorporated into the feature vector the use of punctuation and certain markdown formatting methods that are commonly used on Reddit (such as “[ ]( )” used to display hyperlinks). I also added a few slang words that are commonly found on social media and that resemble some of the function words from before. These include words like “yeah”, “gonna”, and “haha”. This boosted my model’s performance to 93.8% accuracy in correctly matching users out of a group of 3,000 users.

Probability distribution of the final model

Alpha for non-matching accounts

From the distribution, I was able to determine critical values at which I can reject the null hypothesis that a user is a non-match. Because incorrectly banning a user as a false positive is much more detrimental than determining a false negative, alpha should be as small as possible, but should also be chosen with respect to the total number of users.

Final Application

A hierarchical clustering model was used to determine if any of the randomly chosen accounts were in fact authored by a common user. The dendogram below displays authorship clustering analysis of 40 users. An alpha level of 0.1% was chosen considering there were fairly few users being compared and the chances of finding a false positive are minimal. Clustering was used on all 3,000 randomly chosen users as well, and not surprisingly, none of them were a match.