In earlier posts, here and here I discussed how to write a scraper and make it secure and foolproof. These things are good to implement but not good enough to make it fast and efficient.

In this post, I am going to show how a change of a few lines of code can speed up your web scraper by X times. Keep reading!

If you remember the post, I scraped the detail page of OLX. Now, usually, you end up to this page after going thru the listing of such entries. First, I will make a script without multiprocessing, we will see why is it not good and then a scraper with multiprocessing.

OK, the goal is to access this page and fetch all URLs from the page. For the sake of simplicity, I am not covering the pagination part. So, let’s get into code.

Here’s the gist of accessing listing from the URL and then parse and fetch information of each entry.

View the code on Gist.

I have divided scripts into two functions: get_listing() access the listing page, parses and saves the list in the list and return it and parse(url) which takes an individual url parses the info and returns a comma-delimited string.

I am then calling get_list() to get a list of links and then using mighty list comprehension to get a list of individual parsed entry and saving ALL info in a CSV file.

Then I executed the script by using time command:

Adnans-MBP:~ AdnanAhmad$ time python listing_seq.py 1 Adnans - MBP : ~ AdnanAhmad $ time python listing_seq . py

which calculates the time a process takes. On my computer it returns:

real 5m49.168s user 0m2.876s sys 0m0.198s 1 2 3 real 5m49.168s user 0m2.876s sys 0m0.198s

hmm.. around 6minutes for 50 records. There’s a 2 seconds delay in each iteration so minus 1 and a half minute so makes it 4 and a half minute.

Now.. I am going to make a few lines of change and make it running into parallel.

Keep reading!

The first change is using a new Python module, Multiprocessing:

from multiprocessing import Pool 1 from multiprocessing import Pool

From the documentation:

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

So unlike thread, it locks the thing w.r.t a process instead of a thread. It reminds me kind of MapReduce thing but obviously not the same. Now note the following lines:

p = Pool(10) # Pool tells how many at a time records = p.map(parse, cars_links) p.terminate() p.join() 1 2 3 4 p = Pool ( 10 ) # Pool tells how many at a time records = p . map ( parse , cars_links ) p . terminate ( ) p . join ( )

The Pool here playing an important role, it tells how many subprocesses should be spawn at a time. Here I mentioned 10 which means 10 URLs will be processed at a single time.

The second line, the first argument is the function which will be multi-processed and second argument is the number of links in list format. In our case there are 50 links so there will be 5 iterations, 10 URLs will be accessed and parsed in a go and will return data in form of list.

The third line is actually terminate a process, in *Nix it’s SIGTERM and In Windows it uses TerminateProcess()

The last line, in simple words join() makes sure to avoid zombie processes and end all process gracefully.

If you don’t use terminate() and join(), the ONLY issue you’d have that you’d have many zombie or defunct process occupying your machine without any reason. I’m sure you definitely want that.

Or, you can use Context Manager to make it further simple. Thanks KimPeek on Reddit Python for this tip!

with Pool(10) as p: records = p.map(parse, cars_links) 1 2 with Pool ( 10 ) as p : records = p . map ( parse , cars_links )

Alright, I ran this script:

Adnans-MBP:~ AdnanAhmad$ time python list_parallel.py 1 Adnans - MBP : ~ AdnanAhmad $ time python list_parallel .py

And the time it took:

real 0m22.884s user 0m2.748s sys 0m0.363s 1 2 3 real 0m22.884s user 0m2.748s sys 0m0.363s

Here, same 2 seconds delay but since all processed in parallel so it took around 22 seconds. I reduced the Pool Size to 5, still, quite good!

P(5) real 0m43.695s user 0m2.829s sys 0m0.336s 1 2 3 4 P ( 5 ) real 0m43.695s user 0m2.829s sys 0m0.336s

Hope you’d be implementing multiprocessing to speed up your next web scrapers. Give your feedback in comments and let everyone knows how could it be made much better than this one. Thanks.

Writing scrapers is an interesting journey but you can hit the wall if the site blocks your IP. As an individual, you can’t afford expensive proxies either. Scraper API provides you an affordable and easy to use API that will let you scrape websites without any hassle. You do not need to worry about getting blocked because Scraper API by default uses proxies to access websites. On top of it, you do not need to worry about Selenium either since Scraper API provides the facility of a headless browser too. I also have written a post about how to use it.

Click here to signup with my referral link or enter promo code adnan10, you will get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out

As usual, the code is available on Github.

Planning to write a book about Web Scraping in Python. Click here to give your feedback





