Building a Reddit Bot that Detects Trash - Python Reddit API Wrapper (PRAW) tutorial p.4

Over the last year or so, I have seen the sharp rise of affiliate marketing spam for Udemy courseson Reddit. The main subreddits that I frequent are /r/python and /r/learnpython. At least on those subreddits, these spam links would actually get upvoted, because the title looked appealing and sounded good. It was clear that many people didn't bother actually looking into what was actually happening.

Many times, the spam would be posted via a submission on Medium, so people were thinking it was some Medium writeup on the topic, when really it was just a course description and a Udemy affiliate link. After getting annoyed enough, I started posting on these threads to just point out these threads were spam. Interestingly enough, this actually seemed to make a huge impact and people started paying attention to it. I even considered about 6 months ago to write a bot to automatically identify posts like these, and just make an automated reply to them to make it more obvious to people that this was just spam for profit, but I always had things that I was more interested in working on.

...Until I saw my own course being pirated, put up for sale on Udemy, and spam-marketed to Reddit. Oh, you know what, I've got a few minutes.

I had always figured this spam was put forth by course creators themselves, but it's become clear to me that these are massive spam rings, doing referral/affiliate spam for profit. The accounts on Reddit appear to be created, and then sit dormant for ~2 months before going active, as if this will somehow make it appear as though they're legit.

My first step is to manually search Google for the Udemy course name, since I am sure this scammer is also a spammer. I search: Mastery Python 3 Basics Tutorial Series

Immediately, I find the Udemy course of course, a medium post linking to the udemy course, and a bunch of spammy discount sites. Okay, hmm, no surprise. Hey, I've seen all that Reddit spam with Udemy courses, I wonder if it's there. Again, another google search for site:reddit.com Mastery Python 3 Basics Tutorial Series

Jackpot!

Okay, let's click on one of those Reddit posts:

Yep, there it is. Hey, I wonder if we just clicked on the name...

Wow, that's a lot of courses.

Okay, now what? Well, we probably do not want to repeat this process via Python, due to the Google search. We could use something like the google-search package, but, having experience with trying to maintain a program that uses Google search, I would like to avoid this at all costs.

You know what, I bet we could do all of this via the Python Reddit API Wrapper. We can use the PRAW to search reddit for phrases like "Udemy" or "Udemy Free," for example. From here, I am going to wager almost all of these are going to be spam posts, given the current state of Udemy spam. That said, some of these might actually be legitimate, and we'd rather not be wrong. How might we identify spam posts and spam authors?

There are obviously *many* ways to do this, but you can usually tell immediately by just looking at the user's profile. My plan is to just find Udemy related threads, then visit the author's profile, and see how much Udemy junk is in there. If more than, say, half of their posts are about Udemy courses, then we're going to call that a spammer, or at least notify people of the situation by posting on the suspicious threads from that author.

Our first step? Well, we need a Reddit account. I've make a new one, calling it Spam_Detector_Bot , a fitting name. Next, we need to go into preferences , then the apps tab. Now, we'll click to create a new app. From here, pick a fitting name, such as Idiot Detector , I don't know, just spit-balling here. Okay, next, we need to pick script , fill in a description, and then add an about and redirect url. Feel free to use your github, some personal site, or you can use https://pythonprogramming.net . When you're done, create the app.

Now, I will create a new .py file, calling it praw_creds.py . Inside it:

client_id = '' client_secret = '' password = '' user_agent = '' username = ''

From the API information, fill this out, save, and close it.

Alright, next, I'll create a new python file, calling it to_catch_a_spammer.py . To begin, let's work on the functionality to use search. If you haven't yet, do a pip install praw .

to_catch_a_spammer.py

import praw from praw_creds import client_id, client_secret, password, user_agent, username reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, password=password, user_agent=user_agent, username=username)

What we've done here so far is just import praw, our API credentials, and setup the Reddit instance. Now for searching:

def find_spam_by_name(search_query): authors = [] for submission in reddit.subreddit("all").search(search_query, sort="new", limit=11): print(submission.title, submission.author, submission.url) if submission.author not in authors: authors.append(submission.author) return authors

This will search for a phrase, then return the authors we find from the newest submissions of that phrase. Let's test it:

if __name__ == '__main__': authors = find_spam_by_name("Free Udemy") for author in authors: print(str(author))

Full code up to now:

import praw from praw_creds import client_id, client_secret, password, user_agent, username reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, password=password, user_agent=user_agent, username=username) def find_spam_by_name(search_query): authors = [] for submission in reddit.subreddit("all").search(search_query, sort="new", limit=11): print(submission.title, submission.author, submission.url) if submission.author not in authors: authors.append(submission.author) return authors if __name__ == '__main__': authors = find_spam_by_name("Free Udemy") for author in authors: print(str(author))

Running this gives us:

FREE Udemy course: MVVM Design Pattern Using Swift in iOS mizaksad https://twitter.com/gamingsaledeals/status/953502567945916418 Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakucisf https://twitter.com/gamingsaledeals/status/953502892778024960 Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakuciss https://twitter.com/gamingsaledeals/status/953502892778024960 Free Udemy Course - Intro To The Steemit Social Media Blogging Platform — Steemit flaplanet https://steemit.com/steemit/@johnelder-org/free-udemy-course-intro-to-the-steemit-social-media-blogging-platform Udemy Online Courses for free dfslol https://www.dealnews.com/lw/landing.html?uri=%2FUdemy-Online-Courses-for-free%2F2177081.html%3Firef%3Drss-dealnews-todays-edition Get 260 Udemy Paid Course Free With Deal5star Deal5star https://www.youtube.com/watch?v=J8TZTj3hbt4&feature=youtu.be FREE Udemy Course: Podcasting: How To Make Your Own Podcast yellowsnow3000 https://twitter.com/somethingometh/status/951047771787874304 FREE Udemy course: Business Goal-Setting Masterclass KellyfromLeedsUK https://www.reddit.com/r/business/comments/7pfeov/free_udemy_course_business_goalsetting_masterclass/?utm_source=ifttt 9 Udemy Courses taught by Chris M Nemo will be FREE! JANUARY,10 , FROM 6 am GMT to 4 PM GMT Deal5star https://deal5star.com/9-udemy-courses-taught-chris-m-nemo-will-free-january10-6-gmt-4-pm-gmt/ FREE Udemy course: Business Goal-Setting Masterclass dealbawt https://www.reddit.com/r/business/comments/7pfeov/free_udemy_course_business_goalsetting_masterclass/?utm_source=ifttt My Udemy Course Launch! (Free Coupons) BeyondUsGames https://www.reddit.com/r/gamemaker/comments/7p4rsw/my_udemy_course_launch_free_coupons/ mizaksad plakucisf plakuciss flaplanet dfslol Deal5star yellowsnow3000 KellyfromLeedsUK dealbawt BeyondUsGames

Great, we've got some authors. Now, just because someone posts something on Reddit about Udemy or some free course, it doesn't mean they're a spammer. We need to look a bit deeper into these accounts.

Let's change our main block now, starting with:

if __name__ == "__main__": while True: current_search_query = random.choice(["udemy"]) spam_content = [] trashy_users = {} smelly_authors = find_spam_by_name(current_search_query)

In the interest of possibly adding new spammy sources/phrases/words, I will have the current_search_query be a random choice of varying words. For now, my main focus is on Udemy spam, so that's the only choice, but let's make this script grow-able in the future! Since we're using random here, let's import it:

import random

Now, with these smelly authors, we need to see how much of their content is trash (spam):

for author in smelly_authors: user_trashy_urls = [] sub_count = 0 dirty_count = 0

We'll save some starting information, an empty list to populate with submissions, a submission counter and a dirty counter for each submission from each "smelly" author that we're looking into. At the top of our script, let's add some common words that are used with spam:

common_spammy_words = ['udemy','course','save','coupon','free','discount']

try: for sub in reddit.redditor(str(author)).submissions.new(): submit_links_to = sub.url submit_id = sub.id submit_subreddit = sub.subreddit submit_title = sub.title dirty = False for w in common_spammy_words: if w in submit_title.lower(): dirty = True junk = [submit_id,submit_title] if junk not in user_trashy_urls: user_trashy_urls.append([submit_id,submit_title,str(author)]) if dirty: dirty_count+=1 sub_count+=1 except Exception as e: print(str(e))

Above, we begin to iterate through the potentially trashy author's submissions, looking for common_spammy_words in the titles. If we do find them, let's log this, and continue through the author's submissions. Once we've gone through them, let's generate a trashy_score :

try: for sub in reddit.redditor(str(author)).submissions.new(): submit_links_to = sub.url submit_id = sub.id submit_subreddit = sub.subreddit submit_title = sub.title dirty = False for w in common_spammy_words: if w in submit_title.lower(): dirty = True junk = [submit_id,submit_title] if junk not in user_trashy_urls: user_trashy_urls.append([submit_id,submit_title,str(author)]) if dirty: dirty_count+=1 sub_count+=1 try: trashy_score = dirty_count/sub_count except: trashy_score = 0.0 print("User {} trashy score is: {}".format(str(author), round(trashy_score,3))) if trashy_score >= 0.5: trashy_users[str(author)] = [trashy_score,sub_count] for trash in user_trashy_urls: spam_content.append(trash) except Exception as e: print(str(e))

Full code up to this point:

import praw from praw_creds import client_id, client_secret, password, user_agent, username import random common_spammy_words = ['udemy','course','save','coupon','free','discount'] reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, password=password, user_agent=user_agent, username=username) def find_spam_by_name(search_query): authors = [] for submission in reddit.subreddit("all").search(search_query, sort="new", limit=11): print(submission.title, submission.author, submission.url) if submission.author not in authors: authors.append(submission.author) return authors if __name__ == "__main__": while True: current_search_query = random.choice(["udemy"]) spam_content = [] trashy_users = {} smelly_authors = find_spam_by_name(current_search_query) for author in smelly_authors: user_trashy_urls = [] sub_count = 0 dirty_count = 0 try: for sub in reddit.redditor(str(author)).submissions.new(): submit_links_to = sub.url submit_id = sub.id submit_subreddit = sub.subreddit submit_title = sub.title dirty = False for w in common_spammy_words: if w in submit_title.lower(): dirty = True junk = [submit_id,submit_title] if junk not in user_trashy_urls: user_trashy_urls.append([submit_id,submit_title,str(author)]) if dirty: dirty_count+=1 sub_count+=1 try: trashy_score = dirty_count/sub_count except: trashy_score = 0.0 print("User {} trashy score is: {}".format(str(author), round(trashy_score,3))) if trashy_score >= 0.5: trashy_users[str(author)] = [trashy_score,sub_count] for trash in user_trashy_urls: spam_content.append(trash) except Exception as e: print(str(e))

Output from one loop of this:

Any Udemy course for $9.99 sepang-moto http://techshippers.com Udemy : Get Your First SEO Client Using Freelance Sites onlinefreecourses http://offersallin1.com/coupons/udemy-get-your-first-seo-client-using-freelance-sites/ Select Courses $0 at Udemy dfslol https://bensbargains.com/bargain/select-courses-573148/#rss Udemy best news courses | January 18, 2018 Carolin3 http://mailchi.mp/f56dd2542632/your-daily-best-udemy-courses-selection-1441441 Cisco CCNA 200-125 : Full Course For Networking Basics - Udemy khongbietmatkhau11 http://dlfree24h.com/ebooks/405687-cisco-ccna-200-125-full-course-for-networking-basics.html FREE Udemy course: MVVM Design Pattern Using Swift in iOS mizaksad https://twitter.com/gamingsaledeals/status/953502567945916418 Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakucisf https://twitter.com/gamingsaledeals/status/953502892778024960 Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakuciss https://twitter.com/gamingsaledeals/status/953502892778024960 The Complete Steemit Cryptocurrency Course is 92% off at Udemy mizaksaf https://twitter.com/gamingsaledeals/status/953596050627088385 The Complete Steemit Cryptocurrency Course is 92% off at Udemy mizaksad https://twitter.com/gamingsaledeals/status/953596050627088385 Udemy course: Docker Mastery - The Complete Toolset From a Docker Captain is 94% off plakucisf https://www.reddit.com/r/docker/comments/7r0ztj/udemy_course_docker_mastery_the_complete_toolset/ User sepang-moto trashy score is: 0.45 User onlinefreecourses trashy score is: 0.927 User dfslol trashy score is: 0.26 User Carolin3 trashy score is: 1.0 received 404 HTTP response User mizaksad trashy score is: 0.444 User plakucisf trashy score is: 0.658 User plakuciss trashy score is: 0.592 User mizaksaf trashy score is: 0.333

So we've found at least a few clear spammers, like onlinefreecourses , plakucisf , and plakuciss .

Okay, now what? Well, let's iterate through the spam content, and post some love on it!

Let's go ahead and import time

import time

Then:

for spam in spam_content: spam_id = spam[0] spam_user = spam[2] submission = reddit.submission(id=spam[0]) created_time = submission.created_utc if time.time()-created_time <= 86400: link = "https://reddit.com"+submission.permalink message = """*Beep boop* I am a bot that sniffs out spammers, and this smells like spam. At least {}% out of the {} submissions from /u/{} appear to be for Udemy affiliate links. Don't let spam take over Reddit! Throw it out! *Bee bop*""".format(round(trashy_users[spam_user][0]*100,2), trashy_users[spam_user][1], spam_user) try: with open("posted_urls.txt","r") as f: already_posted = f.read().split('

') if link not in already_posted: print(message) submission.reply(message) print("We've posted to {} and now we need to sleep for 12 minutes".format(link)) with open("posted_urls.txt","a") as f: f.write(link+'

') time.sleep(12*60) break except Exception as e: print(str(e)) time.sleep(12*60)

Full code at this point:

import praw from praw_creds import client_id, client_secret, password, user_agent, username import random import time common_spammy_words = ['udemy','course','save','coupon','free','discount'] reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, password=password, user_agent=user_agent, username=username) def find_spam_by_name(search_query): authors = [] for submission in reddit.subreddit("all").search(search_query, sort="new", limit=11): print(submission.title, submission.author, submission.url) if submission.author not in authors: authors.append(submission.author) return authors if __name__ == "__main__": while True: current_search_query = random.choice(["udemy"]) spam_content = [] trashy_users = {} smelly_authors = find_spam_by_name(current_search_query) for author in smelly_authors: user_trashy_urls = [] sub_count = 0 dirty_count = 0 try: for sub in reddit.redditor(str(author)).submissions.new(): submit_links_to = sub.url submit_id = sub.id submit_subreddit = sub.subreddit submit_title = sub.title dirty = False for w in common_spammy_words: if w in submit_title.lower(): dirty = True junk = [submit_id,submit_title] if junk not in user_trashy_urls: user_trashy_urls.append([submit_id,submit_title,str(author)]) if dirty: dirty_count+=1 sub_count+=1 try: trashy_score = dirty_count/sub_count except: trashy_score = 0.0 print("User {} trashy score is: {}".format(str(author), round(trashy_score,3))) if trashy_score >= 0.5: trashy_users[str(author)] = [trashy_score,sub_count] for trash in user_trashy_urls: spam_content.append(trash) except Exception as e: print(str(e)) for spam in spam_content: spam_id = spam[0] spam_user = spam[2] submission = reddit.submission(id=spam[0]) created_time = submission.created_utc if time.time()-created_time <= 86400: link = "https://reddit.com"+submission.permalink message = """*Beep boop* I am a bot that sniffs out spammers, and this smells like spam. At least {}% out of the {} submissions from /u/{} appear to be for Udemy affiliate links. Don't let spam take over Reddit! Throw it out! *Bee bop*""".format(round(trashy_users[spam_user][0]*100,2), trashy_users[spam_user][1], spam_user) try: with open("posted_urls.txt","r") as f: already_posted = f.read().split('

') if link not in already_posted: print(message) submission.reply(message) print("We've posted to {} and now we need to sleep for 12 minutes".format(link)) with open("posted_urls.txt","a") as f: f.write(link+'

') time.sleep(12*60) break except Exception as e: print(str(e)) time.sleep(12*60)

Running this, we get something like:

Ending with:

Any Udemy course for $9.99 sepang-moto http://techshippers.com Udemy : Get Your First SEO Client Using Freelance Sites onlinefreecourses http://offersallin1.com/coupons/udemy-get-your-first-seo-client-using-freelance-sites/ Select Courses $0 at Udemy dfslol https://bensbargains.com/bargain/select-courses-573148/#rss Udemy best news courses | January 18, 2018 Carolin3 http://mailchi.mp/f56dd2542632/your-daily-best-udemy-courses-selection-1441441 Cisco CCNA 200-125 : Full Course For Networking Basics - Udemy khongbietmatkhau11 http://dlfree24h.com/ebooks/405687-cisco-ccna-200-125-full-course-for-networking-basics.html FREE Udemy course: MVVM Design Pattern Using Swift in iOS mizaksad https://twitter.com/gamingsaledeals/status/953502567945916418 Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakucisf https://twitter.com/gamingsaledeals/status/953502892778024960 Unity: Beginner to Advanced - Complete Course is available for FREE at Udemy plakuciss https://twitter.com/gamingsaledeals/status/953502892778024960 The Complete Steemit Cryptocurrency Course is 92% off at Udemy mizaksaf https://twitter.com/gamingsaledeals/status/953596050627088385 The Complete Steemit Cryptocurrency Course is 92% off at Udemy mizaksad https://twitter.com/gamingsaledeals/status/953596050627088385 Udemy course: Docker Mastery - The Complete Toolset From a Docker Captain is 94% off plakucisf https://www.reddit.com/r/docker/comments/7r0ztj/udemy_course_docker_mastery_the_complete_toolset/ User sepang-moto trashy score is: 0.45 User onlinefreecourses trashy score is: 0.927 User dfslol trashy score is: 0.29 User Carolin3 trashy score is: 1.0 received 404 HTTP response User mizaksad trashy score is: 0.444 User plakucisf trashy score is: 0.658 User plakuciss trashy score is: 0.592 User mizaksaf trashy score is: 0.333 *Beep boop* I am a bot that sniffs out spammers, and this smells like spam. At least 92.71% out of the 96 submissions from /u/onlinefreecourses appear to be for Udemy affiliate links. Don't let spam take over Reddit! Throw it out! *Bee bop* We've posted to https://reddit.com/r/udemyfreebies/comments/7r7nno/udemy_get_your_first_seo_client_using_freelance/ and now we need to sleep for 12 minutes

Speaking of which, the subreddit /r/udemyfreebies/ is basically *all* spam and trash. I am going to guess that most users on here know exactly what this content is, but we'll still post for now.

Want to contribute to this project? I've hosted it on github: reddit_spam_detector_bot.