Packages required

These are the packages I used

import csv

from bs4 import BeautifulSoup

import pandas as pd

import requests

csv allows you to manipulate and create csv files.

BeautifulSoup is the web scraping library.

Pandas will be used to create a dataframe to put our results into a table.

Requests is used to send HTTP requests; to find a web page and return its contents.

Finding all the PMQ records

First I needed to collect all the Hansard debate records for PMQs.



for i in range(1,20):

url = '

rall = requests.get(url)

r = rall.content

soup = BeautifulSoup(r,"lxml")

titles = soup.find_all('a',class_="no-underline")

for t in titles:

if t['title'].lower() == "prime minister [house of commons]":

hurl = '

hansardurls.append(hurl)

print(len(hansardurls)) hansardurls = []for i in range(1,20):url = ' https://hansard.parliament.uk/search/Debates?endDate=2019-10-28&house=Commons&searchTerm=%22Prime+Minister%22&startDate=2009-06-23&page={}&partial=true'.format(i) rall = requests.get(url)r = rall.contentsoup = BeautifulSoup(r,"lxml")titles = soup.find_all('a',class_="no-underline")for t in titles:if t['title'].lower() == "prime minister [house of commons]":hurl = ' https://hansard.parliament.uk'+t['href' hansardurls.append(hurl)print(len(hansardurls))

I went onto https://hansard.parliament.uk and clicked Find Debates, then searched for “Prime Minister” within the dates John Bercow has been Speaker. This gave me a set of search results 19 pages long.

Results set from Hansard

The code starts by creating an empty list called ‘hansardurls’, where I put the URLs for PMQ records.

Then the code loops through the 19 pages of search results. Each loop does the following:

Requests the URL and returns the content of the page requested

Processes the HTML using BeautifulSoup

Searches through the HTML to find the links to each debate on that page, and assigns them the variable ‘title’. I found the links by going onto a results page, right-clicking on a debate title and selecting ‘Inspect’. This takes you to the console which shows you the HTML of that page element.

Finds those titles which contain “prime minister” only. There were some debate titles which mentioned the Prime Minister, but were not PMQs, so these have been excluded

Some of the titles were in upper case, so all of the titles are put into lower case using .lower() in order to catch all relevant debate titles

The ‘if’ statement is saying if the title of the debate is “prime minister”, create a full URL for that debate and add it to the list called ‘hansardurls’. The links in the title element do not have “hansard.parliament.uk” in front of them, this needs to be added first to make them usable links.

The last line prints out the number of PMQ URLs found and added to the list: 327.

Export the PMQ URLs

I wanted the PMQ URLs in a separate CSV, the reason for which I will go into later.

with open(‘hansardurls.csv’, ‘w’, newline=’’) as myfile:

wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)

wr.writerow(hansardurls)

This code creates a file called ‘hansardurls.csv’ and writes each object in the list ‘hansardurls’ to a separate line in that csv.

Find the Speaker Contributions in each PMQ session

Now I had links for all the PMQ sessions in John Bercow’s speakership, I wanted to look at each PMQs and see if a Speaker made a contribution and if so, whether it was John Bercow or a deputy.

Speakercontrib = []

Speakingtime = []

urloftime = []

First I created 3 separate lists: One to hold the name of the speaker speaking in the question time, one to hold the date on which the Member was speaking and one to hold the URL of the PMQs in which they were speaking.

for h in hansardurls:

rall = requests.get(h)

r = rall.content

soup = BeautifulSoup(r,"lxml")

time = soup.find('div',class_="col-xs-12 debate-date").text

contributors = soup.find_all('h2',class_="memberLink")



I then wrote a for loop to request each PMQs URL, return the HTML of that page, then use BeautifulSoup to find the elements on the page that correspond to the names of Members speaking. I found this by opening a PMQs page, right-clicking on a Member’s name and finding what sort of element that name is and the class of that element.

The screen you see after clicking inspect on a page element.

for c in contributors:

link = c.find('a')

try:

member = link.text

except:

print(c)

if "Speaker" in member:

Speakercontrib.append(member)

Speakingtime.append(time)

urloftime.append(h)

I then ran a for loop across each Member’s name on the PMQs page. This loop does the following:

Finds the link within the Member’s name page element. The actual link which contains the Member’s name text and ID is nested inside the h2 tag as seen in the screenshot above.

Try to find the text of the link which contains the Member’s name. I’ve used a try statement in here because there are contributions in Hansard which have no name attached. These are usually when several Members rose or called out at the same time. Without the try statement, this script would stop whenever it came across a “Hon. Members rose” contribution. The statement prints out the Member name to let me know whether it was a “Member rose” contribution, or if there is some other error occurring.

An example of the phantom contribution, “Members rose — “

Runs an if statement that says if the member’s name contains “Speaker”, add the name, the date of the contribution and the URL of the page to their respective lists.

Create a table

All the data has been collected from Hansard, but it is in 3 separate lists. I used Pandas to create a dataframe and added the 3 lists into the dataframe to create a usable table of results.

speakersdf = pd.DataFrame(

{'Date': Speakingtime,

'Speaker': Speakercontrib,

'url': urloftime

})

This code creates a dataframe called ‘speakersdf’ and adds in the three lists as columns.

I then exported the dataframe as a CSV file:

speakersdf.to_csv('speakerspmqdf.csv')

I was able to go through the spreadsheet and find that John Bercow had spoken in 319 PMQs and Lindsay Hoyle, Deputy Speaker, had spoken in 1 PMQs.

But wait….there were 327 PMQs URLs….there are 7 PMQs missing.

This is why I exported the PMQs URLs in a separate document. I took the URLs from the speakersdf table and compared them to the URLs in the hansardurls list (I did this in Excel), and found the 7 missing PMQs URLs.

When I went to these 7 webpages, I found that the Speaker had not made a contribution in these 7 PMQ sessions, so they were not picked up by the script. In these cases, I manually went through the Hansard and found which Speaker was in the chair prior to PMQs starting.

The final result:

John Bercow chaired 326 PMQ sessions in his tenure (to 28/10/2019)*

*This is subject to the data I scraped from Hansard being complete and correct and my workflow for collecting this data being accurate. Please don’t take it as gospel truth.