Last week I was working on a web scraper for a client who needed to get around a million of records from a real estate website. After a certain level the scraper stopped working and the reason was I forgot to put a certain checks as I was expecting client would not go for that route but he DID!

A few days back I shared a post about how to write basic scraper in Python by using Beautifulsoup. In this post I am going to discuss how to make your scraper more fool proof and user friendly for non-technical people.

1- Check 200 status code

It is always good to check the HTTP status code earlier and proceed accordingly.

This is good:

if r.status_code == 200: #Proceed further 1 2 if r . status_code == 200 : #Proceed further

This is better:

if r.status_code != 200: return False 1 2 if r . status_code != 200 : return False

2- Never Trust HTML

Yep, specially if you can’t control it. Web scraping depends on HTML DOM, a simple change in element or class name could break your entire script. The best way to deal with it to check whether it returns None or not.

page_count = soup.select('.pager-pages > li > a') if page_count: #do your stuff else: # ALERT!! Send notification to Admin 1 2 3 4 5 page_count = soup . select ( '.pager-pages > li > a' ) if page_count : #do your stuff else : # ALERT!! Send notification to Admin

Here I am checking whether the CSS selector returned something legitimate, if yes then proceed further.

3 – Set headers

Python Requests does not force you to use request headers while sending requests but there are few smart websites that does not let you to get read anything important unless certain headers are not set in it. Once I faced the situation that the HTML I was seeing in browser was different than what I was getting via my script. So, it is always good to make your requests as legitimate as you can. The least you should do is to set a User-Agent.

headers = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'} r = requests.get(url, headers=headers, timeout=5) 1 2 3 4 headers = { 'user-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' } r = requests . get ( url , headers = headers , timeout = 5 )

4- Set timeout

One of the issue with Python Requests is that, if you don’t mention timeout, it will keep trying till it’s last breathe. This might be good for some certain conditions but not in majority cases. Therefore, it’s always good to set a timeout value for each request. Here I am setting timeout to 5 seconds.

r = requests.get(url, headers=headers, timeout=5) 1 r = requests . get ( url , headers = headers , timeout = 5 )

5- Exception handling

It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:

try: # your logic is here except requests.ConnectionError as e: print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.

") print(str(e)) except requests.Timeout as e: print("OOPS!! Timeout Error") print(str(e)) except requests.RequestException as e: print("OOPS!! General Error") print(str(e)) except KeyboardInterrupt: print("Someone closed the program") 1 2 3 4 5 6 7 8 9 10 11 12 13 14 try : # your logic is here except requests . ConnectionError as e : print ( "OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.

" ) print ( str ( e ) ) except requests . Timeout as e : print ( "OOPS!! Timeout Error" ) print ( str ( e ) ) except requests . RequestException as e : print ( "OOPS!! General Error" ) print ( str ( e ) ) except KeyboardInterrupt : print ( "Someone closed the program" )

Check the very last one. This one tells the program that if someone wants to terminate program by using Ctrl+C then it wrap things up first and then exist. This situation is good if you are storing information in file and wants to dump all at the time of exit.

6- Efficient File Handling

One of the functions of web scrapers is to store data either in Db or flat files like CSV/Text. If you are scraping a large amount of data, it is not to do I/O operation within a loop. Let me show you how I do it:

try: a_list_variable = [] a_list_variable.extend(a_func_return_record()) except requests.ConnectionError as e: print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.

") print(str(e)) except requests.Timeout as e: print("OOPS!! Timeout Error") print(str(e)) except requests.RequestException as e: print("OOPS!! General Error") print(str(e)) except KeyboardInterrupt: print("Someone closed the program") finally: print("Total Records = " + str(len(property_urls))) try: # file to store state based URLs record_file = open('records_file.txt', 'a+') record_file.write("

".join(property_urls)) record_file.close() except Exception as ex: print("Unable to store records in CSV file. Techncical details below.

") print(str(e)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 try : a_list_variable = [ ] a_list_variable . extend ( a_func_return_record ( ) ) except requests . ConnectionError as e : print ( "OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.

" ) print ( str ( e ) ) except requests . Timeout as e : print ( "OOPS!! Timeout Error" ) print ( str ( e ) ) except requests . RequestException as e : print ( "OOPS!! General Error" ) print ( str ( e ) ) except KeyboardInterrupt : print ( "Someone closed the program" ) finally : print ( "Total Records = " + str ( len ( property_urls ) ) ) try : # file to store state based URLs record_file = open ( 'records_file.txt' , 'a+' ) record_file . write ( "

" . join ( property_urls ) ) record_file . close ( ) except Exception as ex : print ( "Unable to store records in CSV file. Techncical details below.

" ) print ( str ( e ) )

Here I am calling a function(though you are not bound to do it like that) that is appending records in a list. Once it’s done or program gets terminated, before termination it will just dump entire list in file in a single go. Much better than multiple I/Os.

I hope you will find it useful. Please share your experience about how else one could make a scraper super-efficient.

Planning to write a book about Web Scraping in Python. Click here to give your feedback





