Iterating through Dynamic Select Options with Selenium

In this post I’ll use Selenium to show how to iterate through dropdown menus in a form that uses SELECT elements whose option values are dynamically generated. I’ll provide a technique that can be used to determine when the option values have loaded. I’ll wrap up the post by refactoring the code into a more generic solution, which is useful since this use-case arises frequently in scraping.

Background

The site I’ll use in today’s example is at the following URL:

http://icds-wcd.nic.in/icds/icdsawc.aspx

Click on the link and you’ll be presented with the following form:

Our object is to iterate through all of the State, District and Project dropdowns and print out their option values.

Here’s the catch though: if you try selecting an option from the Select District dropdown, you’ll see that it’s empty.

You can inspect the Select District element to confirm this. Note that there is only one option: “-Select-“

In this form, the options in the Select District dropdown aren’t populated until a state has been chosen. This means that the range of possible values in the district SELECT element will always depend on the current value of the state selection.

The Select Project dropdown works the same way:

Its list of options is only filled after a district has been chosen.

To summarize how this form operates: first we select a state. That triggers an update which causes the district options to load. Then we select a district, which triggers another update to dynamically load the project values.

With that in mind, here’s the pseudocode sketch of the solution we will develop.

for each state select state wait for district options to load for each district select district wait for project options to load for each project print state, district, project

Implementation

Let’s begin. Here’s the boilerplate for our scraper. It assumes that you have Selenium and PhantomJS installed:

#!/usr/bin/env python import sys import signal from selenium import webdriver from selenium.webdriver.support.ui import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import StaleElementReferenceException def sigint ( signal , frame ): sys . exit ( 0 ) class Scraper ( object ): def __init__ ( self ): self . url = 'http://icds-wcd.nic.in/icds/icdsawc.aspx' self . driver = webdriver . PhantomJS () self . driver . set_window_size ( 1120 , 550 ) if __name__ == '__main__' : signal . signal ( signal . SIGINT , sigint ) scraper = Scraper () scraper . scrape ()

First we create a load_page method which retrieves the form page and waits until the state SELECT element is rendered before returning:

def load_page ( self ): self . driver . get ( self . url ) def page_loaded ( driver ): path = '//select[@id="ctl00_ContentPlaceHolder1_dropstate"]' return driver . find_element_by_xpath ( path ) wait = WebDriverWait ( self . driver , 10 ) wait . until ( page_loaded )

Next, let’s write a scrape method to implement the pseudocode presented above:

def scrape ( self ): self . load_page () for state in states (): print state for district in districts (): print 2 * ' ' , district for project in projects (): print 4 * ' ' , project

The scrape method uses generators to iterate through all of the option values for the state, district and project SELECT elements.

Let’s take a look at the states generator:

def states (): state_select = self . get_state_select () state_select_option_values = [ '%s' % o . get_attribute ( 'value' ) for o in state_select . options [ 1 :] ] for v in state_select_option_values : state_select = self . select_state_option ( v ) yield state_select . first_selected_option . text

The states function generates a list of all the option values contained in the state SELECT element. It uses yield to allow the caller to iterate through that list, selecting the next option each time states gets called.

The districts and projects generators are implemented the same way:

def districts (): district_select = self . get_district_select () district_select_option_values = [ '%s' % o . get_attribute ( 'value' ) for o in district_select . options if o . text != '-Select-' ] for v in district_select_option_values : district_select = self . select_district_option ( v ) yield district_select . first_selected_option . text def projects (): project_select = self . get_project_select () project_select_option_values = [ '%s' % o . get_attribute ( 'value' ) for o in project_select . options [ 1 :] ] for v in project_select_option_values : project_select = self . select_project_option ( v ) yield project_select . first_selected_option . text

There are two types of helper methods used by the generators. They follow the same patterns:

Get a reference to a SELECT element

Select one of the options

Let’s look at the first type: methods used to get a reference to a SELECT element. Here’s the code for get_state_select which returns a reference to the state SELECT element:

def get_state_select ( self ): path = '//select[@id="ctl00_ContentPlaceHolder1_dropstate"]' state_select_elem = self . driver . find_element_by_xpath ( path ) state_select = Select ( state_select_elem ) return state_select

It looks up a reference to an element given its xpath. Then it uses the Select constructor to create an instance of the WebDriver Select support class which is used to interact with SELECT elements.

We get references to the district and project SELECT elements the same way:

def get_district_select ( self ): path = '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' district_select_elem = self . driver . find_element_by_xpath ( path ) district_select = Select ( district_select_elem ) return district_select def get_project_select ( self ): path = '//select[@id="ctl00_ContentPlaceHolder1_dropproject"]' project_select_elem = self . driver . find_element_by_xpath ( path ) project_select = Select ( project_select_elem ) return project_select

Now let’s take a look at the methods for selecting option values. The first method we’ll look at, select_state_option, is used to select a value from the state SELECT dropdown:

def select_state_option ( self , value , dowait = True ): ''' Select state value from dropdown. Wait until district dropdown has loaded before returning. ''' path = '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' district_select_elem = self . driver . find_element_by_xpath ( path ) def district_select_updated ( driver ): try : district_select_elem . text except StaleElementReferenceException : return True except : pass return False state_select = self . get_state_select () state_select . select_by_value ( value ) if dowait : wait = WebDriverWait ( self . driver , 10 ) wait . until ( district_select_updated ) return self . get_state_select ()

This method selects a state value and then waits for the district options to load. It determines when the district options have loaded by:

Getting a reference to the district SELECT element Selecting an option in the state select dropdown Waiting for the district SELECT element from step 1 to return a StaleElementReferenceException when we reference its text attribute.

When we select a state, the DOM gets updated with new district options corresponding to the chosen state. That causes any old references to the district SELECT to become “stale.”

That means there’s an easy litmus test to determine when a select dropdown has fininshed loading: wait for existing references to it to become stale.

We’ll repeat this same pattern for the district and project SELECT elements:

def select_district_option ( self , value , dowait = True ): ''' Select district value from dropdown. Wait until district dropdown has loaded before returning. ''' path = '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' district_select_elem = self . driver . find_element_by_xpath ( path ) def district_select_updated ( driver ): try : district_select_elem . text except StaleElementReferenceException : return True except : pass return False district_select = self . get_district_select () district_select . select_by_value ( value ) if dowait : wait = WebDriverWait ( self . driver , 10 ) wait . until ( district_select_updated ) return self . get_district_select ()

The project dropdown doesn’t trigger any updates when we select a value, so its implementation is very simple:

def select_project_option ( self , value , dowait = True ): project_select = self . get_project_select () project_select . select_by_value ( value ) return self . get_project_select ()

Now we have a working implementation. This is version 1 of our scraper. You can view its source at:

https://github.com/thayton/icds/blob/master/v1.scraper.py

Let’s try running it:

$ ./v1.scraper.py Andaman & Nicobar Islands Nicobars North & Middle Andaman South Andaman Andhra Pradesh Adilabad Adilabad U Asifabad Bellampally Boath Chennur JAINOOR Kaghaznagar Khanapur ...

Refactoring

In this section, I’m going to show how we can refactor the code we just developed into a more concise implementation.

First, the get_state_select, get_district_select and get_state_select methods can all be replaced with a generic get_select method that takes the xpath of a SELECT element as an argument.

def get_select ( self , xpath ): select_elem = self . driver . find_element_by_xpath ( xpath ) select = Select ( select_elem ) return select

Next, if you take a look at the code for selecting the state and district option values (select_state_option and select_district_option) you might notice that both methods are repeating the following pattern:

Select an option

Wait for some other SELECT element’s options to load

This pattern appears frequently in forms with SELECT elements whose option values are dynamically generated. We can refactor this pattern into something more generic.

We’ll replace the following methods:

select_state_option

select_district_option

select_project_option

with a new generic select_option method.

select_option will take the xpath of a SELECT element, an option value to choose, and the xpath of another SELECT element that we’ll wait to see updated before returning:

def select_option ( self , xpath , value , waitfor_elem_xpath = None ): if waitfor_elem_xpath : func = make_waitfor_elem_updated_predicate ( self . driver , waitfor_elem_xpath ) select = self . get_select ( xpath ) select . select_by_value ( value ) if waitfor_elem_xpath : wait = WebDriverWait ( self . driver , 10 ) wait . until ( func ) return self . get_select ( xpath )

The select_option method relies on make_waitfor_elem_updated_predicate to create a function that can be passed as an argument to Selenium’s WebDriverWait until method.

def make_waitfor_elem_updated_predicate ( driver , waitfor_elem_xpath ): elem = driver . find_element_by_xpath ( waitfor_elem_xpath ) def elem_updated ( driver ): try : elem . text except StaleElementReferenceException : return True except : pass return False return lambda driver : elem_updated ( driver )

The function returned by make_waitfor_elem_updated_predicate determines whether an element has been updated in the DOM by checking if the reference to that element is stale. It gets called over and over by wait until it either returns True or a timeout occurs.

With select_option we can now use the following code to select a state and have it wait for the district SELECT element to get udpated before returning:

# '35' is the option value of the state "Andaman & Nicobar Islands" self . select_option ( '//select[@id="ctl00_ContentPlaceHolder1_dropstate"]' , '35' , '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' )

Now let’s revisit the states, districts and projects generators. They all follow the pattern:

Get a reference to a SELECT element

Generate its list of values

Iterate through those values selecting each option as we go

We encapsulate this logic in a new method make_select_option_iterator that returns a generator function. All we need to supply are the xpath of the SELECT element whose options we want to iterate over, and the xpath of the SELECT whose options get updated every time a value from the first SELECT gets chosen:

def make_select_option_iterator ( self , xpath , waitfor_elem_xpath ): def next_option ( xpath , waitfor_elem_xpath ): select = self . get_select ( xpath ) select_option_values = [ '%s' % o . get_attribute ( 'value' ) for o in select . options if o . text != '-Select-' ] for v in select_option_values : select = self . select_option ( xpath , v , waitfor_elem_xpath ) yield select . first_selected_option . text return lambda : next_option ( xpath , waitfor_elem_xpath )

next_option is the generator that iterates through the option values. We use a lambda function to generate a closure around the xpath arguments we pass to make_select_option_iterator.

Now we can make generators for the states, districts and projects with just a few lines of code:

states = self . make_select_option_iterator ( '//select[@id="ctl00_ContentPlaceHolder1_dropstate"]' , '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' ) districts = self . make_select_option_iterator ( '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' , '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' ) projects = self . make_select_option_iterator ( '//select[@id="ctl00_ContentPlaceHolder1_dropproject"]' , None )

Here’s our final implementation. It’s much more concise than our earlier version:

#!/usr/bin/env python import sys import signal from selenium import webdriver from selenium.webdriver.support.ui import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import StaleElementReferenceException def sigint ( signal , frame ): sys . exit ( 0 ) def make_waitfor_elem_updated_predicate ( driver , waitfor_elem_xpath ): elem = driver . find_element_by_xpath ( waitfor_elem_xpath ) def elem_updated ( driver ): try : elem . text except StaleElementReferenceException : return True except : pass return False return lambda driver : elem_updated ( driver ) class Scraper ( object ): def __init__ ( self ): self . url = 'http://icds-wcd.nic.in/icds/icdsawc.aspx' self . driver = webdriver . PhantomJS () self . driver . set_window_size ( 1120 , 550 ) def get_select ( self , xpath ): select_elem = self . driver . find_element_by_xpath ( xpath ) select = Select ( select_elem ) return select def select_option ( self , xpath , value , waitfor_elem_xpath = None ): if waitfor_elem_xpath : func = make_waitfor_elem_updated_predicate ( self . driver , waitfor_elem_xpath ) select = self . get_select ( xpath ) select . select_by_value ( value ) if waitfor_elem_xpath : wait = WebDriverWait ( self . driver , 10 ) wait . until ( func ) return self . get_select ( xpath ) def make_select_option_iterator ( self , xpath , waitfor_elem_xpath ): def next_option ( xpath , waitfor_elem_xpath ): select = self . get_select ( xpath ) select_option_values = [ '%s' % o . get_attribute ( 'value' ) for o in select . options if o . text != '-Select-' ] for v in select_option_values : select = self . select_option ( xpath , v , waitfor_elem_xpath ) yield select . first_selected_option . text return lambda : next_option ( xpath , waitfor_elem_xpath ) def load_page ( self ): self . driver . get ( self . url ) def page_loaded ( driver ): path = '//select[@id="ctl00_ContentPlaceHolder1_dropstate"]' return driver . find_element_by_xpath ( path ) wait = WebDriverWait ( self . driver , 10 ) wait . until ( page_loaded ) def scrape ( self ): states = self . make_select_option_iterator ( '//select[@id="ctl00_ContentPlaceHolder1_dropstate"]' , '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' ) districts = self . make_select_option_iterator ( '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' , '//select[@id="ctl00_ContentPlaceHolder1_dropdistrict"]' ) projects = self . make_select_option_iterator ( '//select[@id="ctl00_ContentPlaceHolder1_dropproject"]' , None ) self . load_page () for state in states (): print state for district in districts (): print 2 * ' ' , district for project in projects (): print 4 * ' ' , project if __name__ == '__main__' : signal . signal ( signal . SIGINT , sigint ) scraper = Scraper () scraper . scrape ()

You can see all of the code developed in this article on GitHub at the repository:

https://github.com/thayton/icds

Shameless Plug

Have a scraping project you’d like done? I’m available for hire. Contact me for a free quote.