Today I’m going to walk you through the process of scraping search results from Reddit using Python. We’re going to write a simple program that performs a keyword search and extracts useful information from the search results. Then we’re going to improve our program’s performance by taking advantage of parallel processing.



We’ll be using the following Python 3 libraries to make our job easier:

Beautiful Soup 4,

Requests to access the HTML content,

LXML as the HTML parser,

and Multiprocessing to speed things up.

multiprocessing comes with Python 3 by default as far as I know, but you may need to install the others manually using a package manager such as PIP:

1

2

3

pip3 install beautifulsoup4

pip3 install requests

pip3 install lxml



Old Reddit

Before we begin, I want to point out that we’ll be scraping the old Reddit, not the new one. That’s because the new site loads more posts automatically when you scroll down:





The problem is that it’s not possible to simulate this scroll-down action using a simple tool like Requests. We’d need to use something like Selenium for that kind of thing. As a workaround, we’re going to use the old site which is easier to crawl using the links located on the navigation panel:

Scraper v1 - Program Arguments

Let’s start by making our program accept some arguments that will allow us to customize our search. Here are some useful parameters:

keyword to search

subreddit restriction (optional)

date restriction (optional)

Let’s say we want to search for the keyword “web scraping”. In this case, the URL we want to go is:

https://old.reddit.com/search?q=%22web+scraping%22

If we want to limit our search with a particular subreddit such as “r/Python”, then our URL will become:

https://old.reddit.com/r/Python/search?q=%22web+scraping%22&restrict_sr=on

Finally, the URL is going to look like one of the following if we want to search for the posts submitted in the last year:

https://old.reddit.com/search?q=%22web+scraping%22&t=year

https://old.reddit.com/r/Python/search?q=%22web+scraping%22&restrict_sr=on&t=year

The following is the initial version of our program that builds and prints the appropriate URL according to the program arguments:

scraper.py (v1) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

import argparse



SITE_URL = 'https://old.reddit.com/'



if __name__ == '__main__' :

parser = argparse.ArgumentParser()

parser.add_argument( '--keyword' , type=str, help= 'keyword to search' )

parser.add_argument( '--subreddit' , type=str, help= 'optional subreddit restriction' )

parser.add_argument( '--date' , type=str, help= 'optional date restriction (day, week, month or year)' )

args = parser.parse_args()

if args.subreddit == None :

searchUrl = SITE_URL + 'search?q="' + args.keyword + '"'

else :

searchUrl = SITE_URL + 'r/' + args.subreddit + '/search?q="' + args.keyword + '"&restrict_sr=on'

if args.date == 'day' or args.date == 'week' or args.date == 'month' or args.date == 'year' :

searchUrl += '&t=' + args.date

print( 'Search URL:' , searchUrl)



Now we can run our program as follows:

1

python3 scraper.py --keyword= "dave weckl" --subreddit= "drums" --date= "month"



Scraper v2 - Collecting Search Results

If you take a look at the page source, you’ll notice that all the post results are stored in <div> s with a search-result-link class. Also note that unless it’s the last page, there will be an <a> tag with a <rel> attribute equal to nofollow next . That’s how we’ll know when to stop advancing to the next page.

Therefore using the URL we built from the program arguments, we can collect the post sections from all pages with a simple function that we’ll call getSearchResults . Here’s the second version of our program:

scraper.py (v2) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

from bs4 import BeautifulSoup

import argparse

import requests



SITE_URL = 'https://old.reddit.com/'

REQUEST_AGENT = 'Mozilla/5.0 Chrome/47.0.2526.106 Safari/537.36'



def createSoup (url) :

return BeautifulSoup(requests.get(url, headers={ 'User-Agent' :REQUEST_AGENT}).text, 'lxml' )



def getSearchResults (searchUrl) :

posts = []

while True :

resultPage = createSoup(searchUrl)

posts += resultPage.findAll( 'div' , { 'class' : 'search-result-link' })

footer = resultPage.findAll( 'a' , { 'rel' : 'nofollow next' })

if footer:

searchUrl = footer[ -1 ][ 'href' ]

else :

return posts



if __name__ == '__main__' :

parser = argparse.ArgumentParser()

parser.add_argument( '--keyword' , type=str, help= 'keyword to search' )

parser.add_argument( '--subreddit' , type=str, help= 'optional subreddit restriction' )

parser.add_argument( '--date' , type=str, help= 'optional date restriction (day, week, month or year)' )

args = parser.parse_args()

if args.subreddit == None :

searchUrl = SITE_URL + 'search?q="' + args.keyword + '"'

else :

searchUrl = SITE_URL + 'r/' + args.subreddit + '/search?q="' + args.keyword + '"&restrict_sr=on'

if args.date == 'day' or args.date == 'week' or args.date == 'month' or args.date == 'year' :

searchUrl += '&t=' + args.date

posts = getSearchResults(searchUrl)

print( 'Search URL:' , searchUrl, '

Found' , len(posts), 'posts.' )



Scraper v3 - Parsing Post Data

Now that we have a bunch of posts in the form of a bs4.element.Tag array, we can extract useful information by parsing each element of this array further. We can extract information such as:

Information Source date datetime attribute of the <time> tag title <a> tag with search-title class score <span> tag with search-score class author <a> tag with author class subreddit <a> tag with search-subreddit-link class URL href attribute of the <a> tag with search-comments class # of comments text field of the <a> tag with search-comments class

We’re also going to create a container object to store the extracted data and save it as a JSON file ( product.json ). We’ll load this file in the beginning of our program which may contain data from other keyword searches. When we’re done scraping the current keyword, we’ll append the new content to the existing data. Here’s the third version of our program:

scraper.py (v3) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

from datetime import datetime

from bs4 import BeautifulSoup

import argparse

import requests

import json

import re



SITE_URL = 'https://old.reddit.com/'

REQUEST_AGENT = 'Mozilla/5.0 Chrome/47.0.2526.106 Safari/537.36'



def createSoup (url) :

return BeautifulSoup(requests.get(url, headers={ 'User-Agent' :REQUEST_AGENT}).text, 'lxml' )



def getSearchResults (searchUrl) :

posts = []

while True :

resultPage = createSoup(searchUrl)

posts += resultPage.findAll( 'div' , { 'class' : 'search-result-link' })

footer = resultPage.findAll( 'a' , { 'rel' : 'nofollow next' })

if footer:

searchUrl = footer[ -1 ][ 'href' ]

else :

return posts



def parsePosts (posts, product, keyword) :

for post in posts:

time = post.find( 'time' )[ 'datetime' ]

date = datetime.strptime(time[: 19 ], '%Y-%m-%dT%H:%M:%S' )

title = post.find( 'a' , { 'class' : 'search-title' }).text

score = post.find( 'span' , { 'class' : 'search-score' }).text

score = int(re.match( r'[+-]?\d+' , score).group( 0 ))

author = post.find( 'a' , { 'class' : 'author' }).text

subreddit = post.find( 'a' , { 'class' : 'search-subreddit-link' }).text

commentsTag = post.find( 'a' , { 'class' : 'search-comments' })

url = commentsTag[ 'href' ]

numComments = int(re.match( r'\d+' , commentsTag.text).group( 0 ))

product[keyword].append({ 'title' :title, 'url' :url, 'date' :str(date),

'score' :score, 'author' :author, 'subreddit' :subreddit})

return product



if __name__ == '__main__' :

parser = argparse.ArgumentParser()

parser.add_argument( '--keyword' , type=str, help= 'keyword to search' )

parser.add_argument( '--subreddit' , type=str, help= 'optional subreddit restriction' )

parser.add_argument( '--date' , type=str, help= 'optional date restriction (day, week, month or year)' )

args = parser.parse_args()

if args.subreddit == None :

searchUrl = SITE_URL + 'search?q="' + args.keyword + '"'

else :

searchUrl = SITE_URL + 'r/' + args.subreddit + '/search?q="' + args.keyword + '"&restrict_sr=on'

if args.date == 'day' or args.date == 'week' or args.date == 'month' or args.date == 'year' :

searchUrl += '&t=' + args.date

try :

product = json.load(open( 'product.json' ))

except FileNotFoundError:

print( 'WARNING: Database file not found. Creating a new one...' )

product = {}

print( 'Search URL:' , searchUrl)

posts = getSearchResults(searchUrl)

print( 'Started scraping' , len(posts), 'posts.' )

keyword = args.keyword.replace( ' ' , '-' )

product[keyword] = []

product = parsePosts(posts, product, keyword)

with open( 'product.json' , 'w' , encoding= 'utf-8' ) as f:

json.dump(product, f, indent= 4 , ensure_ascii= False )



Now we can search for different keywords by running our program multiple times. The extracted data will be appended to the product.json file after each execution.

So far we’ve been able to scrape information from the post results easily, since this information is available in a given results page. But we might also want to scrape comment information which cannot be accessed from the results page. We must instead parse the comment page of each indiviadual post using the URL that we previously extract in our parsePosts funciton.

If you take a close look at the HTML source of a comment page such as this one, you’ll see that the comments are located inside a <div> with a sitetable nestedlisting class. Each comment inside this <div> is stored in another <div> with a data-type attribute equal to comment . From there, we can obtain some useful information such as:

Information Source # of replies data-replies attribute author <a> tag with author class inside the <p> tag with tagline class date datetime attribute in the <time> tag inside the <p> tag with tagline class comment ID name attribute in the <a> tag inside the <p> tag with parent class parent ID <a> tag with the data-event-action attribute equal to parent text text field of the <div> tag with md class score text field of the <span> tag with score unvoted class

Let’s create a new function called parseComments and call it from our parsePosts function so that we can get the comment data along with the post data:

scraper.py (v4 - partial) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

def parseComments (commentsUrl) :

commentTree = {}

commentsPage = createSoup(commentsUrl)

commentsDiv = commentsPage.find( 'div' , { 'class' : 'sitetable nestedlisting' })

comments = commentsDiv.findAll( 'div' , { 'data-type' : 'comment' })

for comment in comments:

numReplies = int(comment[ 'data-replies' ])

tagline = comment.find( 'p' , { 'class' : 'tagline' })

author = tagline.find( 'a' , { 'class' : 'author' })

author = "[deleted]" if author == None else author.text

date = tagline.find( 'time' )[ 'datetime' ]

date = datetime.strptime(date[: 19 ], '%Y-%m-%dT%H:%M:%S' )

commentId = comment.find( 'p' , { 'class' : 'parent' }).find( 'a' )[ 'name' ]

content = comment.find( 'div' , { 'class' : 'md' }).text.replace( '

' , '' )

score = comment.find( 'span' , { 'class' : 'score unvoted' })

score = 0 if score == None else int(re.match( r'[+-]?\d+' , score.text).group( 0 ))

parent = comment.find( 'a' , { 'data-event-action' : 'parent' })

parentId = parent[ 'href' ][ 1 :] if parent != None else ''

parentId = '' if parentId == commentId else parentId

commentTree[commentId] = { 'author' :author, 'reply-to' :parentId, 'text' :content,

'score' :score, 'num-replies' :numReplies, 'date' :str(date)}

return commentTree



def parsePosts (posts, product, keyword) :

for post in posts:

time = post.find( 'time' )[ 'datetime' ]

date = datetime.strptime(time[: 19 ], '%Y-%m-%dT%H:%M:%S' )

title = post.find( 'a' , { 'class' : 'search-title' }).text

score = post.find( 'span' , { 'class' : 'search-score' }).text

score = int(re.match( r'[+-]?\d+' , score).group( 0 ))

author = post.find( 'a' , { 'class' : 'author' }).text

subreddit = post.find( 'a' , { 'class' : 'search-subreddit-link' }).text

commentsTag = post.find( 'a' , { 'class' : 'search-comments' })

url = commentsTag[ 'href' ]

numComments = int(re.match( r'\d+' , commentsTag.text).group( 0 ))

commentTree = {} if numComments == 0 else parseComments(url)

product[keyword].append({ 'title' :title, 'url' :url, 'date' :str(date), 'score' :score,

'author' :author, 'subreddit' :subreddit, 'comments' :commentTree})

return product



Scraper v5 - Multiprocessing

Our program is functionally complete at this point. However, it runs a bit slowly because all the work is done serially by a single process. We can improve the performance by handling the posts by multiple processes using the Process and Manager objects from the multiprocessing library.

The first thing we need to do is to rename the parsePosts function and make it handle only a single post. To do that, we’re simply going to remove the for statement. We also need to change the function parameters a little bit. Instead of passing our original product object, we’ll pass a list object to append the results obtained by the current process.

scraper.py (v5 - partial) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

def parsePost (post, results) :

time = post.find( 'time' )[ 'datetime' ]

date = datetime.strptime(time[: 19 ], '%Y-%m-%dT%H:%M:%S' )

title = post.find( 'a' , { 'class' : 'search-title' }).text

score = post.find( 'span' , { 'class' : 'search-score' }).text

score = int(re.match( r'[+-]?\d+' , score).group( 0 ))

author = post.find( 'a' , { 'class' : 'author' }).text

subreddit = post.find( 'a' , { 'class' : 'search-subreddit-link' }).text

commentsTag = post.find( 'a' , { 'class' : 'search-comments' })

url = commentsTag[ 'href' ]

numComments = int(re.match( r'\d+' , commentsTag.text).group( 0 ))

commentTree = {} if numComments == 0 else parseComments(url)

results.append({ 'title' :title, 'url' :url, 'date' :str(date), 'score' :score,

'author' :author, 'subreddit' :subreddit, 'comments' :commentTree})



results is actually a multiprocessing.managers.ListProxy object that we can use to accumulate the output generated by all processes. We’ll later convert it to a regular list and save it in our product. Our main script will now look like as follows:

scraper.py (v5 - partial) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

if __name__ == '__main__' :

parser = argparse.ArgumentParser()

parser.add_argument( '--keyword' , type=str, help= 'keyword to search' )

parser.add_argument( '--subreddit' , type=str, help= 'optional subreddit restriction' )

parser.add_argument( '--date' , type=str, help= 'optional date restriction (day, week, month or year)' )

args = parser.parse_args()

if args.subreddit == None :

searchUrl = SITE_URL + 'search?q="' + args.keyword + '"'

else :

searchUrl = SITE_URL + 'r/' + args.subreddit + '/search?q="' + args.keyword + '"&restrict_sr=on'

if args.date == 'day' or args.date == 'week' or args.date == 'month' or args.date == 'year' :

searchUrl += '&t=' + args.date

try :

product = json.load(open( 'product.json' ))

except FileNotFoundError:

print( 'WARNING: Database file not found. Creating a new one...' )

product = {}

print( 'Search URL:' , searchUrl)

posts = getSearchResults(searchUrl)

print( 'Started scraping' , len(posts), 'posts.' )

keyword = args.keyword.replace( ' ' , '-' )

results = Manager().list()

jobs = []

for post in posts:

job = Process(target=parsePost, args=(post, results))

jobs.append(job)

job.start()

for job in jobs:

job.join()

product[keyword] = list(results)

with open( 'product.json' , 'w' , encoding= 'utf-8' ) as f:

json.dump(product, f, indent= 4 , ensure_ascii= False )



This simple technique alone will greatly speed-up the performance. For instance when I perform a search involving 163 posts in my machine, the serial version of the program takes 150 seconds to execute, corresponding to approximately 1 post per second. On the other hand, the parallel version only takes 15 seconds to execute (~10 posts per second) which is 10x faster.

You can check out the complete source code on Github. Also, make sure to subscribe to get updates on my future articles.