How to use machine learning to make prediction on Reddit: multiple logistic regression

Here are four screencaps from four different marketplaces. The common element? Seller ratings.

Rating’s from Amazon.com, Google.com, Facebook.com and Ebay.com

Seller ratings are a requirement of every good marketplace. A rating system is a vital tool to facilitate decision-making for buyers, allow them to share their feelings — and, most importantly, to assure them of the transaction’s legitimacy. Of course, it’s never possible to guarantee a safe transaction, but a good seller rating does inspire confidence.

Considering the importance of rating systems, every marketplace has a built-in, transaction-facilitating system for rating sellers. But what about platforms that weren’t meant for selling and buying? For example: Reddit.

Reddit: A Forum-Turned-Marketplace

Reddit.com, ranked as the sixth most visited website in the United States, is a collection of forums where people post and comment on links, pictures, and discussion topics. The individual forums or communities within Reddit are called Subreddits, and there are over 1 million of these. Reddit’s members, called “Redditors,” submit content to the Subreddits, where other members either “upvote” or “downvote” it. As you can imagine, Reddit (whose name is a play on the words “read it”) was meant to be a place for discussion and although it still is for the most part, some communities have become marketplaces. And there are a lot of these as well.

Here’s a list of some of the popular marketplaces on Reddit, along with the number of members who were subscribed to each one at the time of writing.

That’s almost 650,000 people in just 8 communities (Subreddits)!

I believe the marketplaces on Reddit are popular because Reddit’s sense of community usually leads to reasonable prices and successful transactions. However, as Reddit was not initially designed for this purpose, there are many issues with its interface compared to those of traditional marketplaces such as eBay.

For example, here is a typical user post:

screen cap from Reddit.com

The post is offering a BNIB Corsair SF750 Platinum PSU (power supply unit) for $150 + shipping. However, there is no metric for determining if this user has any transaction history within this Subreddit, nor is there any way to determine if the user is legitimate. Additionally, in this post the author writes the original price, but in other cases, prospective buyers would need to research the product themselves to determine its original retail price.

Using Machine Learning to Solve Reddit’s “Rating-less ” Problem

Looking at the way in which Reddit’s marketplaces work led me to construct an algorithm to help solve the problems posed by the lack of a dedicated rating system.

I thought this would be an interesting problem to apply Machine Learning and Python automation to. The goal for this project was to see if I could collect enough data to determine how much risk there is in transacting with any given Reddit user. To collect these data, I looked at Reddit users who had been involved in successful transactions and then identified users who had been caught committing fraud. Essentially, I compared two groups of Redditors: scammers and non-scammers. Some of the attributes I compared were:

● Age of the account

● Karma (upvotes)

● Verified Email Address

● Gold (if a paid account)

● Comments (analyzed using Natural Language Processing)

● Moderator (of any Subreddit)

● Subreddits visited

● Trophies

screen cap from Reddit.com

Looking at the two groups’ user activity, I tried to determine if a given account was high-risk or low-risk.

To keep the project’s scope manageable, I focused on the Subreddit r/hardware swap. HardwareSwap is a community of over 170,000 members who exchange computer-related hardware with one another. In this Subreddit, you can find everything from graphics cards and CPUs to PlayStations and computer peripherals.

Exploratory Data Analysis

When approaching questions like the ones I am investigating, about Reddit’s users and marketplaces, the first step is to do an Exploratory Data Analysis. In this analysis, I will be using or looking into:

● PRAW (Python Reddit API Wrapper)

● Transactions

● Users involved in the transactions

● Direction of transactions (buying vs. selling)

● Estimated price of transactions

PRAW (Python Reddit API Wrapper)

In order to access the r/HardwareSwap data, I signed up for a Reddit API key. You can find PRAW’s quick start guide and additional documentation here. To get started, I needed to register an application on Reddit, where I then received a Reddit client ID and client secret.

Screenshot of Reddit Application Sign-Up Form. Screen cap from Reddit.com

Collection of HardwareSwap Data for Exploratory Data Analysis (EDA)

Now that I have my Reddit API keys, I can start collecting the data I need: details of transactions, users, directions, and estimated transaction prices. Since my primary goal for this project is to compare two groups of Redditors — legitimate users and scammers (and how to identify them?)— I will start by collecting data on these groups.

Transactions

The subreddit r/HardwareSwap has a confirmation thread where users confirm their successful transactions every month. This makes it possible to collect transaction details for Reddit users.

Here is a screenshot of December 2019’s Confirmation Thread, to give you an idea of its format:

Example confirmation thread. Screen cap from Reddit.com

I first collected all the monthly confirmation thread URLs from the time period I wanted to investigate (72 months). To do this, I used PRAW to write a function that searched the r/HardwareSwap Subreddit for all posts with the words ‘confirmed trade thread’ in their titles.

Github link to the above function.

Users Involved in the Transactions

This search yielded a list of the URLs for the past 72 confirmed trade threads, which represent about 6 years of data. Then, I used the Reddit API to pull relevant and specific data from the 72 confirmed trade threads. My main goal was to collect the following:

● Usernames of those involved in the transactions

● Data on who transacted with whom (the buyer and the seller)

● Data on what was bought/sold

To collect these data, I wrote another function that does the following:

● Takes the list of URLs from the confirm_urls function (written in bright blue text in the above screenshot).

● Looks at all the parent comments and collects the

1) Username of the commenter

2) Text of the parent comment

3) Texts and flair (number of confirmed trades).

● Adds (using a Python dictionary) the author of the comment as the key and a list of lists as the value, where it stores the body of the comment and information on the user who is transacted with.

Each resulting data point looked like the following:

Example Dictionary Key, Value pair

This is how the keys in the Python dictionary, as described above, were created: by appending the username (child.group()[2:]) and the body_text, as shown in the above code. If a user in the dictionary had more than one transaction, which was often the case, then the function added another entry to the list of lists stored in the value.

Direction of Transactions

Finally, I wanted to detect each transaction’s direction: that is, whether it was described as buying or selling. Throughout my first run-through of the code, however, I discovered that there was a third category: “traded,” which occurred in instances where two users traded items with each other and exchanged no money.

To identify each transaction’s direction, I looked for common words in the body of comments — for example, “bought,” “buying,” “sold,” or “traded” — and put them into a list, then compared them with a string of text. If those common words showed in a given string, then I would know the direction of the transaction. And, lastly, I added an “ambiguous” category for the transactions in which the direction could not be determined by my function. The following code recorded the direction for each transaction was recorded by the following code:

Github link to the above function

NOTE: Pulling all this data using PRAW took approximately 18 hours. So, if you want to try this yourself, I recommend running the program overnight.

Finally, EDA!

Now, to the Exploratory Data Analysis component. The total value and activity of this Subreddit, r/HardwareSwap, are much-needed for my Risk Analysis. Using the data whose collection I described above, I created a Pandas data frame that includes the data for each transaction’s user, transactor, direction, and product.

First 5 rows of transaction dataframe

The above screenshot is the first 5 rows of the transactions dataframe. The direction indicates whether the user bought or sold the product. For example in row 0 the user orevilo sold to veigs.

The result I got was quite impressive. Using the chart, I discovered that there have been at least 134,013 transactions on this Subreddit over the past 6 years.

Here is a breakdown of the transactions by type:

These four values add up to 134,013 transactions. This is an impressively large number for a Subreddit marketplace.

Estimated Price of Transactions

Once I saw these numbers, I wanted to gain an idea of the average price of a transaction to get a sense of how much value was moving through r/HardwareSwap. Although the resulting value is an approximation, it is an interesting metric. To approximate the average transaction price, I built a function that looks at the past 1,000 transaction submissions and selects submissions in which the link_flair_css_class is closed, which means a transaction has been made.

In the picture below, you can see the different submission flairs: (SELLING, BUYING AND CLOSED)

link_flair_css_class SELLING (blue), BUYING (green), CLOSED (red). Screen cap from Reddit.com

Then, the function checks the body of the submission for a price listed by the seller; I did this using Regex and looked for numbers following a $ symbol. Additionally, the function returns a Pandas data frame with the Price, Date, and URL so that I can check if the approximations were correct.

Average Price 258

This approach yielded an average sample transaction price of $258 for the period between 12/12/2019 and 12/15/2019. If we multiplied that number by the 134,000 recorded transactions, the total value of products traded on r/HardwareSwap over the last 6 years would be estimated at around $34,500,000. Though this is an extreme estimate, even a quarter of this amount would still be a significant amount of money.

Multiple Logistic Regression

The goal of this project is to create a metric to determine the level of risk in transacting with a given Reddit user. There are not a lot of fraudulent activities on Reddit, but they do exist. This project is aimed at discovering the patterns that are associated with the activities of legitimate Redditors, in contrast to those of fraudulent users: scammers.

Considering that Reddit does not keep a transaction history or collect any real identifying information about its users, solving issues that relate to fraud is quite a difficult problem because the metrics Reddit provides are not typical for measuring risk, as compared to traditional credit metrics and social platforms. Reddit does not collect the detailed personal information that would be required to create information records specific to each member, and therefore, the members have a high level of anonymity.

Given this problem, it will be quite interesting to know if accurate predictions can be made using machine learning and the information that Reddit allows users to pull from it. In this project, I am attempting to measure a user’s risk level based on his or her Reddit activities. My first thought is to use a Multiple Logistic Regression Model, a statistical analysis that is used to predict the outcome of dependent variables. In this domain, it can be referred to as a machine learning model. I will make use of this model because only a limited number of features are available for use. The Multiple Logistic Regression Model will allow me to determine the features that will be most relevant in predicting a Reddit user’s risk level.

To begin, I will identify the two (2) groups who are being studied in the project. These are:

1. Reddit Users/Redditors: These are people who have done genuine transactions on Reddit. They have held up their end of the transaction deals.

2. Reddit Scammers: These are scammers on Reddit who have been involved in various fraudulent activities.

The Reddit Scammer data set was gathered from r/hardwareswap’s ban list and the universal ban list . These are both lists of users that were caught acting in a fraudulent manner by moderators. Because this is the only system that any of the marketplace subreddits offer I will have to trust that all of these users were fraudulent.

Now I will use the Reddit API to collect data on the relevant features of both Reddit users and Reddit scammers. After that, I will look at the data to see what separates the two groups.

For the first model, the relevant features that I will look into are as follows:

1. Verified Email Address (whether the user has one)

2. Whether the user is a moderator of any Subreddit

3. Reddit Gold

4. Amount of Karma

5. Age of account

6. Most recent comments (up to 100)

Using these 6 features, I hope to find different patterns between the Redditor and Scammer groups.

NOTE: You may want to check out the available PRAW attributes at https://praw.readthedocs.io/en/latest/code_overview/models/redditor.html (for a full list of Redditor object attributes).

ASSESSMENT OF THE 6 FEATURES

AGE OF ACCOUNT

Praw attribute. Screen cap from https://praw.readthedocs.io

PRAW attribute

One of the PRAW attributes of any user’s account is “created_utc,” which refers to the date the account was created. In order to turn this into a useful feature, I decided to create a variable called “age.” The age variable takes the date indicated in the created_utc attribute and subtracts it from the time of the user’s last comment. Usually, after scammers are caught, their accounts sit dormant. So, to determine the age of their accounts at the time when the scams occurred, I looked at the length of time from the date each account was created to the date of the most recent comment.

The above chart compares scammers (orange) and Redditors (blue). It is apparent from the chart that scammers have a higher proportion of accounts with an age of 1 year or less, whereas Redditors have much older accounts on average. This makes sense, as I would think people who are going to scam wouldn’t use a Reddit account that they have been associated with for a long period of time; however, there is also a decently large number of scammer accounts over the 4-year mark. I would have to think that this is the result of cases where some of the Redditors marked as scammers were in some sort of dispute that the moderators were not able to resolve and both buyer / seller eventually got marked as scammers. It is also possible that some scammers got caught, but just didn’t care and continued to use their accounts.

2. KARMA

Praw Attribute. Screen cap from https://praw.readthedocs.io

PRAW attribute

“Karma” refers to the points that users get when their submissions are upvoted.

The distribution for Karma shows that scammers have a much higher proportion of accounts with 30 or fewer Karma points. Redditors, on the other hand, have a much more normal distribution.

3. EMAIL ADDRESS

Praw Attribute. Screen cap from https://praw.readthedocs.io

A much higher proportion of Redditors have a verified email address as compared to scammers.

4. REDDIT GOLD

Praw Attribute. Screen cap from https://praw.readthedocs.io

Reddit Gold signifies that a user has a paid Reddit account (Premium). I was surprised to see that there are scammers who have Reddit Gold; it seems suspicious that some scammers paid for accounts they use for fraud when Reddit is mostly free. This fact could be a result of cases where a genuine user was wrongly marked as a scammer.

5. MODERATOR

Praw Attribute. Screen cap from https://praw.readthedocs.io

A much bigger proportion of Redditors are moderators of Subreddits. This is what I was expecting, as scammers likely wouldn’t put effort into moderating Subreddits using the same account with which they scammed someone.

6. COMMENTS

You can see that there is a spike in the number of scammers with fewer than 10 comments. This is one variable that I would like to revisit in the future. The Reddit API allows up to 1,000 comments per user, but I took only the last 100 as a case study.

The reason I carefully studied these individual variables is to find a clear difference in the activities of Redditors and scammers. I have been able to achieve this using machine learning. For each variable, the difference in the distribution of activities between the two groups was clearly defined. I strongly believe that if we take all 6 features into cognizance, then we will be able to make accurate predictions about Reddit accounts and their risk level.

THE MULTIPLE LOGISTIC REGRESSION MODEL

In this project, I am interested in predicting a qualitative outcome using the variables I discussed above. This can be referred to as classifying the observations since it involves assigning the observed data to a specific category, which in this case can be either Redditor or Scammer. Specifically, for this classification project, I intend to use the Multiple Logistic Regression Model to predict the probability of an observation. It would be almost impossible to say for sure that a given Reddit user is a scammer, but the probability that his/her activities match those of a fraudulent user is a more realistic metric.

A Multiple Logistic Regression Equation where X = (X1, X2,…….., Xp) are p, predictors, looks like this:

Distribution of Data

Packages:

Numpy — support for large, multi-dimensional arrays and matrices

Pandas — data manipulation and analysis

Sklearn — machine learning library

Statsmodel — explore data, estimate statistical models, and perform statistical tests

PERFORMING THE REGRESSION

For this model, the predictor variables will be Email, Moderator, Gold, Karma, Age, and Comments, while the target variable will be the username. A figure “0” in the target variable represents a Redditor, while a figure “1” represents a scammer.

Ridge Regression argument highlighted in yellow

L2 Regularization — Ridge Regression

Regularization can be used to train models that generalize better on unseen data by preventing the algorithm from overfitting the training dataset. Ridge regression does this by introducing a new best-fit line with a small amount of bias, and in return it gets a significant drop in variance. This helps make long-term predictions better as new data are introduced into the model.

L2 Equation

Screen cap from wikipedia.com

Every parameter except the y-intercept is scaled by Ridge.

Now that I have done some model tuning it is time to run the model using Sklearn.

Below is a summary of the result the model gave after analyzing the test variable data.

Overall, the Multiple Logistic Regression Model did pretty well, predicting well above a random 50/50 guess. The model’s accuracy (shown in the yellow rectangle) was 83%. Before continuing with this model, however, I want to take a look at the p-values to make sure that all of the variables are significant. I will make use of the Statsmodel library to interpret the results I get, as this library gives a much better p-value summary than Sklearn.

In the result above (shown in the red rectangle), the variable “Karma” has a very high p-value, which means it is insignificant in this model. This is a bit surprising, as the EDA had shown that Karma was a relevant feature to differentiate the two groups. Here, however, I will remove the variable “Karma” and rerun the model to see its response.

Model, Version 2 (after removing the predictor variable Karma)

with the feature karma

Without Karma (A greater precision level is highlighted in the result below, which shows the improvements):

without the feature karma

Precision:

This refers to the proportion of positive identifications that were actually correct.

Where TP = true positives and FP = false positives.

Of the 310 scammers predicted by the model, 253 were correct, which yields a precision score of 82%.

Recall:

This refers to the proportion of actual positives that were identified correctly.

Where TP = true positive and FN = false negative.

In the testing data, there were 491 scammers and 253 were identified correctly, which yields a recall score of 52%.

When it comes to a real-world application for this project, it would be extremely difficult to determine prior to this procedure whether or not a Reddit user is a scammer. It is possible that a Redditor shows the activities of a scammer, but has good intentions. On the flip side, a user may have had 5 successful transactions and then decide not to hold up his or her end of the 6th one.

To fully assess the efficiency of the model, I looked at both precision and recall. By adding a threshold to the predicted probability, a lot of value is added to the model. In the figure below, you can see the model predicting whether a Reddit user is genuine or a scammer based on the predicted probability, which uses a base threshold of > .5 for both groups. For example if the probability that a user is a scammer is > .5% then the model will predict scammer for that user.

This is an example output, with 10 predictions made by the model:

10 predictions made by the model

In the above image you can see 10 predictions made by the model. The probabilities are calculated based on the features email, mod, gold, comments and age. In row 0 the model predicted that the user had a .697% similarity to a scammer which would have ultimately been correct as the true result column shows that user 0 was a scammer.

Code for checking the probability threshold at different probabilities:

After applying different thresholds to the test data. By raising the threshold to 60% when predicting for scammers, the number of false positives was reduced, thus raising the precision to 92%.

Using the model — Application Use

Find a submission with a user with no confirmed trades

screen cap from Reddit.com

Typical submission, but user has 0 confirmed trades

Input the username into a function that pulls the necessary data from the reddit API. Then the data is feed into the multiple logistic regression model and a rating is returned. To make the application more user friendly a rating under 40 shows as risky and above 60 as safe.

As you can see this user gets a rating of 38.2 which is quite low. Suggesting it may be better to find someone else to buy from.

Using the model — Confirmed User

Find a user with confirmed trades

screen cap from Reddit.com

User with 7 confirmed trades

Because the user has 7 confirmed trades it is likely they are genuine. To create a type of user rating the function outputs previous transactions.

CONCLUSION

Based on these thresholds and results, it would be safe to say that if a Reddit user’s predicted scammer probability is above .6, it would probably be best to find someone else to transact with. So, by looking out for the 6 features/variables Age, Moderator, Karma, Email, Gold, Comment and predicting the probability of all of these except Karma, one can predict the level of risk when transacting with a Reddit user.

If you would like to view all of the code it is hosted on my Github repository.