Today, I suddenly thought of an idea, to extract the performance of individual players from a scorecard displayed on a website. This was conceived as part of improving the cricket simulation web application I developed (‘Freepl’) but that”s on a different context

So I downloaded a scorecard from Cricinfo and started analyzing the page and it was pretty nicely structured with each data being stored in tags having proper CSS classes and ids.

e.g. the name of a batsman batting in the 1st innings could be found in this hierarchy: “#inningsBat1 .inningsRow .playerName” (for those who do not have much idea about CSS selectors, a ‘.’ represents a class and a ‘#’ represents an id These selectors are applied to different HTML tags for styling and javascript accessibility.) .

I first started testing out the patterns using jquery and it was really easy getting all the information I needed. But my objective was to do this in a back end technology which would also update databases on the go. For this I turned to django/python and while I was searching for a suitable HTML/XML parser I came across the HTMLparser module. It was good but didn’t offer the power of the jquery selectors. That’s when I came across Pyquery. Its a module which offers the functionality of jquery in python. Furthermore it was created with the same mindset as mine which is missing the power of Jquery on python.

It offers a similar kind of functions like jquery, not all of them yet, but a lot of them. The complete api reference is available here. e.g. the following python code would open the scorecard available here and extract the name and runs scored of the batsman batting at number 5 in the first innings .

from pyquery import PyQuery as pq

d = pq(url = ‘http://www.espncricinfo.com/champions-league-twenty20-2012/engine/match/574265.html‘)

name = d(“#inningsBat1 .inningsRow”).eq(4).find(“.playerName”).find(“a”).html()

runs = d(“#inningsBat1 .inningsRow”).eq(4).find(“.battingRuns”).html()

print “Name: ” + name + “

Runs Scored: “+runs

which would print

Name: S Badrinath

Runs Scored: 2

which is the case.

Now for a little explanation for the code:

d = pq(url = ‘….. line opens the url and defines a PyQuery type object d, in this case. Now once we have defined that , d can be used just as we use ‘$’ in jquery.

the name looks for the 5th entry(indexed from 0) found using the “#inningsBat1 .inningsRow” and finds the “playerName” class within it and gets the html content of the anchor(<a>) tag using the html() method.

Similarly for the runs scored.

This would greatly help in extracting data from sites later on and in many other tasks , as the best of both worlds have combined , the power of Jquery selectors within Django