Implementation notes¶

At this point we've used pattern to extract the table that we want from the html we scraped from the Pro Football Reference Website.

We now want to build a dataframe with the year as the row label and column header as the column label.

Here is what's going on with this code. When the pattern module parsed the hmtl it returne the table that matched the id we provided. Our next step is to get all of the elements of type ('tr'). ('tr') equates to the rows of the table. Keep in mind that the result is returned as a list. Since we're only interested in the headers, which are in the first row, we pull out just the first row.

table[0]('tr')[0]

Now we want the actual text in each cell (i.e. the actual column labels). Selecting ('th') returns a list of all the ('th') elements.

th_s = table[0]('tr')[0]('th')

That gives us a list of elements that look like this

<th><span style="font-size: small;">PLAYER</span></th> <th><span style="font-size: small;">COLLEGE</span></th> <th><span style="font-size: small;">POS</span></th>

Now that we have a list of the ('th') elements we extract the actual text. Concretely, if we have the element

<th><span style="font-size: small;">PLAYER</span></th>

Then we want to extract the text "PLAYER" which we accomplish with the following code [td('span')[0].content if len(td('span')) != 0 else u"n/a"

for td in tds]

Implementation note: Going through the rows¶

We're going to build a dictionary of dictionaries. The key is going to be the player name, the 2nd key will be the column labels, the values will be the combine result for each category.

Extracting data and filling it into a dictionary¶