While working on a side project, I needed to scrape some data from Wikipedia into CSVs. As I wrote the script, I ran into three hurdles: handling multiple tables in an article, data cells that span multiple rows, and removing footnotes from cells.

You can find the finished script on Github.

Handling multiple tables in an article

Not every table on a Wikipedia page has valuable data for analysis. For example, the last table on the volcano height article only has 3 data points, one of which is ‘over 9,000 meters’.

Luckily, there are two CSS selectors Wikipedia uses only on larger tables – sortable and plainrowheaders . Beautiful Soup makes it easy to filter for only these tables in Python. You can test the below code using Python 3 after installing the bs4 and requests packages from pip.

import requests from bs4 import BeautifulSoup WIKI_URL = "https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation" req = requests . get ( WIKI_URL ) soup = BeautifulSoup ( req . content , 'lxml' ) table_classes = { "class" : [ "sortable" , "plainrowheaders" ]} wikitables = soup . findAll ( "table" , table_classes ) print ( wikitables )

Data cells that span multiple rows

Sometimes a data cell spans multiple rows in a table. In theory this is a presentation-level detail, and the underlying tabular data would have the value duplicated across those rows. It’s a nice theory, but HTML decided to go a different route.

Examine this table in the article on highest-grossing films by year. The year 1925 has data from two or more films, and the Year column is stretched across two rows to reflect this. This is the underlying HTML table structure of those two rows:

<tr> <th scope= "row" rowspan= "2" > ... </th> <td> ... </td> <td align= "right" > ... </td> <td align= "right" > ... </td> <td align= "text-align:center" > ... </td> </tr> <tr> <td> ... </td> <td align= "right" > ... </td> <td align= "right" > ... </td> <td align= "text-align:center" > ... </td> </tr>

So when you parse this table, some rows have 5 columns, and some rows have 4 columns. Great.

To solve this, create a list for each table being parsed to track these pesky rowspan elements. Use the first row of the table to define the number of columns.

saved_rowspans = [] for row in table . findAll ( "tr" ): cells = row . findAll ([ "th" , "td" ]) if len ( saved_rowspans ) == 0 : saved_rowspans = [ None for _ in cells ]

As the script parses each row, check the row for any elements with a rowspan attribute. If such an element exists, save the value of the cell and the number of rows it spans to a dictionary. Save this dictionary to the saved_rowspans list at its current position.

(I chose a dictionary because the rows_left value decrements over time, so it needs to be mutable. A Counter-like class would also be a good choice.)

for index , cell in enumerate ( cells ): if cell . has_key ( "rowspan" ): rowspan_data = { 'rows_left' : int ( cell [ "rowspan" ]), 'value' : cell , } saved_rowspans [ index ] = rowspan_data

To use the saved rowspan elements, the loop checks if the number of cells in the current row matches the total number of columns in the table. If it doesn’t, it inserts the value from the same cell in saved_rowspans , and decrements its rows_left . If this is the last row the cell spans into, it sets the saved_rowspans for that cell back to None .

elif len ( cells ) != len ( saved_rowspans ): for index , rowspan_data in enumerate ( saved_rowspans ): if rowspan_data is not None : # Insert the data from previous row; decrement rows left cells . insert ( index , rowspan_data [ 'value' ]) if saved_rowspans [ index ][ 'rows_left' ] == 1 : saved_rowspans [ index ] = None else : saved_rowspans [ index ][ 'rows_left' ] -= 1

Now our table actually has the proper number of columns in each row, with values in the previously blank cells. Stupid HTML.

The final step was to remove footnotes, references, and other garbage from the text. These are important pieces of information for the broader analysis, but do not belong in the tabular data.

I passed each cell of HTML parsed with BeautifulSoup through the following steps to remove these items.

# Strip references from the cell references = cell . findAll ( "sup" , { "class" : "reference" }) if references : for ref in references : ref . extract () # Strip sortkeys from the cell sortkeys = cell . findAll ( "span" , { "class" : "sortkey" }) if sortkeys : for ref in sortkeys : ref . extract () # Strip footnotes from text and join into a single string text_items = cell . findAll ( text = True ) no_footnotes = [ text for text in text_items if text [ 0 ] != '[' ] puretext = '' . join ( no_footnotes ) # Replace non-breaking spaces with regular spaces and add quotes puretext = puretext . replace ( ' \xa0 ' , ' ' )

After that, each cell is surrounded with double quotes and joined into a comma-delimited string, then written to the CSV file.

Final product

The final script can be found on Github, along with the CSV outputs from four articles that I used for testing (e.g. top-grossing films). It’s been reliable for my needs so far, but I’ll update it as I run into new bugs.

If you enjoyed this article, please consider contributing to or starring my project on Github.