How can we use Deep Learning on the massive amount of data humans produce to find patterns of behavior that cannot be seen otherwise?

After some point, you realize everybody is basically a meme

People are unique — this is well agreed upon. Nobody can be you better than you can. However, we all notice that the differences between individuals are always in the same categories — introversion, realism, imagination, etc. Individuals who score high in a set of categories will usually act, think, talk, in manners similar to other individuals with high scoring in the same categories. Especially prevalent in social media, where people usually adopt a stable, and understandable image that will both reflect them in an authentic light while painting them normal enough to be accepted by societal standards and pressures. A popular approach to classifying and categorizing individuals into one of sixteen personality types is the Myers-Briggs (abbrv. MBTI) Personality Test, which “type” an individual based on their preferred four cognitive functions out of the set of all possible cognitive functions. Their cognitive functions are based upon their preference for Sensing or Intuition, Thinking or Feeling, and Judging or Perceiving cognitive functions. While scientifically regarded as a pseudoscience, MBTI continues to be highly popular and widely accepted by popular culture.

The system behind Myers Briggs

While not entirely scientific, the Myers Briggs system does manage to categorize people in some manner, and it is interesting to see if that can be seen in their communication patterns as well. If you want to dive deeper into the theory, I suggest you check out the cognitive function system https://www.psychologyjunkie.com/2018/02/23/introduction-cognitive-functions-myers-briggs-theory/.

Reddit has subreddits for every Myers Briggs type

This is the main reason we’re using MBTI and not Big Five or another personality types system. For Deep Learning, the more data we have, the better our model will (usually) be. Reddit provides us tens of thousands of posts made by communities of self-typed individuals. These posts tend to be very introspective, but they also write upon a wide variety of conversational topics, ranging from their favorite movies to shared daily thoughts and feelings. This sheer amount of diversified data Reddit provides us will allow us to train a neural network to classify Reddit posts, and Reddit users, to a personality type. Most importantly, it will provide us a model that by looking at its classification error, can help us learn whether it’s possible to type an individual strictly based on some of their written text, or whether personality fails to shine through written communication, or maybe that Myers Briggs is an archaic system.

Tools used

fast.ai — for creating and training neural network models simply, in Python 3.7

PRAW — Python wrapper for Reddit, used to pull posts from subreddits

Python 3.7

pandas — for data processing

Data Collection and Processing

Downloading Reddit Data

Using the PRAW wrapper for the Reddit API, we pull the thousand most upvoted posts of all time for each of the sixteen subreddits. We cannot pull more due to a Reddit limitation, but a thousand posts could be enough.

First, we create a class to pull posts from a subreddit

SubredditPuller class we use to pull the top thousand posts of a subreddit

Next, we pull the last thousand posts from each subreddit, delete the posts with just a title (i.e. an image, or link post) and save them into a .csv file for further use.

Our Neural Network Classifier

Now we begin to create our neural network classifier. Using fast.ai’s Python library for fine-tuning pre-trained models, we take their pre-trained LSTM model and fine-tune it twice. We train the language model on all our Reddit posts(to give it an understanding of Reddit speech patterns), next, we train it to classify that domain text. This technique of transfer learning allows us to create industry-standard neural network models very quickly. We take a neural network that has been pre-trained on a variety of language tasks, train it a little more with the same tasks but on the text distribution from which our dataset comes from, and lastly, train it on the text as a classification model.