We all scraped web pages.HTML content returned as response has our data and we scrape it for fetching certain results.If web page has JavaScript implementation, original data is obtained after rendering process. When we use normal requests package in that situation then responses those are returned contains no data in them.Browsers know how to render and display the final result,but how a program can know?. So I came with a power pack solution to scrape any JavaScript rendered website very easily.

Note: For my other tech articles on JS and Software Development visit this site. https://medium.com/dev-bits

Many of us use below libraries to perform scraping.

1)Lxml

2)BeautifulSoup

I don’t mention scrapy or dragline frameworks here since underlying basic scraper is lxml .My favorite one is lxml.why? ,It has the element traversal methods rather than relying on regular expressions methodology like BeautifulSoup.Here I am going to take a very interesting example.I am so amazed after finding that ,my article is appeared in recent PyCoders weekly issue 147.So I am taking PyCoders weekly as an example to scrape all useful links from PyCoders archives.link to PyCoders weekly archives is here.

http://pycoders.com/archive/

It is totally a JavaScript rendered website.I want all links for those archives and next all links from each archive post.How to do that?. First I will show that it returned me nothing when used HTTP approach.

import requests from lxml import html #storing response response = requests.get('http://pycoders.com/archive/') #creating lxml tree from response body tree = html.fromstring(response.text) #Finding all anchor tags in response print tree.xpath('//div[@class="campaign"]/a/@href')

When I run this I got following output

So I returned with only 3 links.How is that possible,because there are nearly 133 archives of PyCoders weekly.So I got nothing in response.Now I will think about tackling the problem.

How can we get the content?

There is one approach of getting data from JS rendered web pages.It is using Web kit library.Web kit library can do everything that a browser can perform.For some browsers Web kit will be the underground element for rendering web pages.Web kit is part of the QT library.So if you installed QT library and PyQT4 then you are ready to go.

You can install it by using command

sudo apt-get install python-qt4

Now everything is finished.We retry the fetching process,but with a different approach.

Here comes the solution

We first give the request through the web kit.We wait until everything is loaded perfectly and then return the completed HTML to a variable.Then we scrape that HTML content using lxml and obtain results.This process is little bit slow but you will be surprised by seeing that content fetched perfectly.

Let us take this code for granted

import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * from lxml import html class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit()

Render class renders the web page. QWebPage is the input URL of web page to scrape.It does something,don’t bother about details.Remember that when we create Render object, it loads everything and creates a frame containing all information about the web page.

url = 'http://pycoders.com/archive/' #This does the magic.Loads everything r = Render(url) #result is a QString. result = r.frame.toHtml()

We are storing the result HTML into variable result.It is not a string to be processed with lxml.So we need to process before using content by lxml.

#QString should be converted to string before processed by lxml formatted_result = str(result.toAscii()) #Next build lxml tree from formatted_result tree = html.fromstring(formatted_result) #Now using correct Xpath we are fetching URL of archives archive_links = tree.xpath('//div[@class="campaign"]/a/@href') print archive_links

It gives us all the links for archives and output is a very populated one.

So next create Render objects with these links as URL and extract the required content.The power of Web kit provides us to render a web page pragmatically then fetches data.So use this technique and get data from any JavaScript rendered web page.

Total code looks like this.

import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * from lxml import html #Take this class for granted.Just use result of rendering. class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit() url = 'http://pycoders.com/archive/' r = Render(url) result = r.frame.toHtml() #This step is important.Converting QString to Ascii for lxml to process archive_links = html.fromstring(str(result.toAscii())) print archive_links

I showed you the fully functional way to scrape a JavaScript rendered web page .Apply this technique to automate any no of steps or integrate this technique and override default behavior of a scraping framework.It is slow but 100% result prone.I hope you enjoyed the post.Try now this on any website you think is tricky to scrape.

All the best.