Overview

Below, we'll show you how to scrape Reddit using Praw (Python Reddit API Wrapper). For this example, our goal will be to scrape the top submissions for the year across a few subreddits, storing the following: submission URL, domain (website URL), submission score. Ultimately, we want to be able to see which domains (urls) generate the highest scoring posts across a given subreddit.

1) Import packages, set up PRAW, select subreddits

Here we can set up our Praw credentials and select the list of subreddits we want to analyze.

#packages import pandas as pd import praw import operator import pandas as pd #set up praw - setup here: http://praw.readthedocs.io/en/latest/getting_started/quick_start.html reddit = praw . Reddit ( client_id = 'my client id' , client_secret = 'my client secret' , user_agent = 'my user agent' ) #create list of subreddits to include s_list = \ [ 'enter subreddits you want to include here as comma

separated strings - e.g. 'news', 'datascience', etc' ]

2) Grab the score, domain (url), and subreddit for each top yearly submission

In this section we're looping through our array of subreddits from above, and storing the score, domain, and subreddit; we'll store each of these attributes in 3 separate dataframes, and merge together using the submission ID.

#set up dictionaries to store submission information domains_sub = {} domains = {} domains_score = {} domains_url = {}

#Loop through our selected list of subreddits for i in s_list : #--Grab the score for a given submission--# #pull in top submissions for the year for subreddit specified in list above subreddit = reddit . subreddit ( i ) submissions = subreddit . top ( 'year' , limit = 50 ) #sum score across submissions for s in submissions : if s . id in domains_score . keys (): domains_score [ s . id ] += s . score else : domains_score [ s . id ] = s . score

df_score = pd . DataFrame . from_dict ( domains_score , orient = 'index' ) . reset_index () df_score . columns = [ 'id' , 'score' ]

#--Grab domain for given submission ID--# subreddit = reddit . subreddit ( i ) #input('enter subreddit name: /r/')) submissions = subreddit . top ( 'year' , limit = 50 ) for s in submissions : if s . id in domains . keys (): domains [ s . id ] = s . domain else : domains [ s . id ] = s . domain df_domain = pd . DataFrame . from_dict ( domains , orient = 'index' ) . reset_index () df_domain . columns = [ 'id' , 'domain' ]

#--Grab subreddit for given submission ID--# subreddit = reddit . subreddit ( i ) submissions = subreddit . top ( 'year' , limit = 50 ) for s in submissions : if s . id in domains_sub . keys (): domains_sub [ s . id ] = s . subreddit . display_name else : domains_sub [ s . id ] = s . subreddit . display_name df_subreddit = pd . DataFrame . from_dict ( domains_sub , orient = 'index' ) . reset_index () df_subreddit . columns = [ 'id' , 'subreddit' ]

Merge dataframes Now that we have dataframes containing score, domain (url), and subreddit we can merge the three tables together, using submission ID as the primary key. #merge the three tables together, using submission ID as primary key df_sub_score = df_subreddit . merge ( df_score , how = 'left' , on = "id" ) df_final = df_sub_score . merge ( df_domain , how = 'left' , on = 'id' ) # Add in submission URL using the 'id' df_final [ 'url' ] = [ 'www.reddit.com/' ] + df_final [ 'id' ] . astype ( str ) df_final . head () id subreddit score domain url 0 78tulq todayilearned 42729 atlasobscura.com www.reddit.com/78tulq 1 76bn5s science 25024 ns.umich.edu www.reddit.com/76bn5s 2 7871xy science 30642 acsh.org www.reddit.com/7871xy 3 77pnk6 science 13176 jech.bmj.com www.reddit.com/77pnk6 4 75eydj gaming 64510 i.redd.it www.reddit.com/75eydj

Done! Explore the output: We now have a nice clean dataframe of the top yearly posts from each chosen subreddit, allowing us to see which domains racked up the highest total scores. I dumped the dataframe into a Google Sheet for you to explore.

Interested in practicing for data scientist or analyst interviews? We send 3 questions each week to thousands of data scientists and analysts preparing for interviews or just keeping their skills sharp. You can sign up to receive the questions for free on our home page.











