7 min read

This article by Javier Collado expands the set of web scraping techniques shown in his previous article by looking closely into a more complex problem that cannot be solved with the tools that were explained there. For those who missed out on that article, here’s the link. Web Scraping with Python

This article will show how to extract the desired information using the same three steps when the web page is not written directly using HTML, but is auto-generated using JavaScript to update the DOM tree.

As you may remember from that article, web scraping is the ability to extract information automatically from a set of web pages that were designed only to display information nicely to humans; but that might not be suitable when a machine needs to retrieve that information. The three basic steps that were recommended to be followed when performing a scraping task were the following:

Explore the website to find out where the desired information is located in the HTML DOM tree

Download as many web pages as needed

Parse downloaded web pages and extract the information from the places found in the exploration step

What should be taken into account when the content is not directly coded in the HTML DOM tree? The main difference, as you probably have already noted, is that using the downloading methods that were suggested in the previous article (urllib2 or mechanize) just don’t work. This is because they generate an HTTP request to get the web page and deliver the received HTML directly to the scraping script. However, the pieces of information that are auto-generated by the JavaScript code are not yet in the HTML file because the code is not executed in any virtual machine as it happens when the page is displayed in a web browser.

Hence, instead of relying on a library that generates HTTP requests, we need a library that behaves as a real web browser, or even better, a library that interacts with a real web browser. So that we are sure that we obtain the same data as we see when manually opening a page in a web browser. Please remember that the aim of web scraping is actually parsing the data that a human user sees, so interacting with a real web browser would be a really nice feature.

Is there any tool out there to perform that? Fortunately, the answer is yes. In particular, there are a couple of tools used for web testing automation that can be used to solve the JavaScript execution problem: Selenium and Windmill . For the code samples in the sections below, Windmill is used. Any choice would be fine as both of them are well documented and stable tools ready to be used for production.

Let’s now follow the same three steps that were suggested in the previous article to solve the scraping of the contents of a web page that is partly generated using JavaScript code.

Explore

Imagine that you are a fan of NASA Image of the day gallery. You want to get a list of the names of all the images in the gallery together with the link to the whole resolution picture just in case you decide to download it later to use as a desktop wallpaper.

The first thing to do is to locate the data that has to be extracted on the desired web page. In the case of the Image of the day gallery (see screenshot below), there are three elements that are important to note:

Title of the image that is being currently displayed

Link to the image full resolution file

Next link to make it possible navigate through all the images

To find out the location of each piece of interesting information, as it was already suggested in the previous article, it’s better to use a tool such as Firebug whose inspect functionality can be really useful. The following picture, for example, shows the location of the image title inside an h3 tag:

The other two fields can be located as easily as the title, so no further explanation will be given here. Please refer to the previous article for further information.

Download

As explained in the introduction, to download the content of the web page, we will use Windmill as it allows the JavaScript code to execute in the web browser before getting the page content.

Because Windmill is mostly a testing library, instead of writing a script that calls the Windmill API, I will write a test case for Windmill to navigate through all the image web pages. The code for the test should be as follows:

1 def test_scrape_iotd_gallery():

2 """

3 Scrape NASA Image of the Day Gallery

4 """

5 # Extra data massage for BeautifulSoup

6 my_massage = get_massage()

7

8 # Open main gallery page

9 client = WindmillTestClient(__name__)

10 client.open(url='http://www.nasa.gov/multimedia/imagegallery/iotd.html')

11

12 # Page isn't completely loaded until image gallery data

13 # has been updated by javascript code

14 client.waits.forElement(xpath=u"//div[@id='gallery_image_area']/img",

15 timeout=30000)

16

17 # Scrape all images information

18 images_info = {}

19 while True:

20 image_info = get_image_info(client, my_massage)

21

22 # Break if image has been already scrapped

23 # (that means that all images have been parsed

24 # since they are ordered in a circular ring)

25 if image_info['link'] in images_info:

26 break

27

28 images_info[image_info['link']] = image_info

29

30 # Click to get the information for the next image

31 client.click(xpath=u"//div[@class='btn_image_next']")

32

33 # Print results to stdout ordered by image name

34 for image_info in sorted(images_info.values(),

35 key=lambda image_info: image_info['name']):

36 print ("Name: %(name)sn"

37 "Link: %(link)sn" % image_info)

As it can be seen, the usage of Windmill is similar to other libraries such as mechanize. For example, first of all a client object has to be created to interact with the browser, (line 9) and later, the main web page, that is going to be used to navigate through all the information, has to be opened (line 10). Nevertheless, it also includes some facilities that take into account JavaScript code as shown at line 14. In this line, the waits.forElement method has been used to look for DOM element that is filled by the JavaScript code so when that element, in this case the big image in the image gallery, is displayed, the rest of the script can proceed. It is important to note here that the web page processing doesn’t start when the page is downloaded (this happens after line 10), but when there’s some evidence that JavaScript code has finished the DOM tree manipulation.

For navigating through all the pages that contain the information needed, this is just a matter of pressing over the next arrow (line 30). As the images are ordered in a circular buffer, the point when it is decided to stop is when the same image link has been parsed twice (line 25).

To execute the script, instead of launching it as we would normally do for a python script, we should call it through the Windmill script to properly initialize the environment:

$ windmill firefox test=nasa_iotd.py

As it can be seen in the following screenshot, Windmill takes care of opening a browser (Firefox in this case) window and a controller window in which it’s possible to see the commands that the script is executing (several clicks on next in the example):

The controller window is really interesting because not only does it display the progress of the test cases, but also allows to enter/record actions interactively, which is a nice feature when trying things out. In particular, the recording may be used under some situations to replace Firebug in the exploration step. This is because the captured actions may be stored in a script without spending much time in xpath expressions.

For more information about how to use Windmill and the complete API, please refer to the Windmill documentation.