Actuarial Articles

Scraping a directory of PDF files with Python

Making a data set with the names of people that passed their actuarial exams

Data Set and Use Case

In the actuarial world you take exams for pay raises and career progression. The professional organization that administers these exams publishes PDF files with the names of students that passed their exams. Someone is scraping these files and running a service where you can look up actuaries and see which exams they have passed. If you type my name into http://www.actuarial-lookup.com/ you will see that I have passed an actuarial exam.

The website only lets you search by name. It would be more useful to some people (like recruiters) to have access to the underlying data in CSV format so that they can discover individuals having a level of speed and progress that suits the needs of a job opening. The data used in the actuarial lookup website was collected from the Society of Actuaries website.

Unfortunately it isn’t posted as CSV files. Each downloadable zip contains a number of folders and within each folder are PDF files with names and candidate numbers for each exam sitting.

Making a Single Data Set

Here is what the data looks like in the PDFs containing exam passer names.

We can use PyPDF2 to extract text from the PDF and regular expressions to parse out the names. Here’s a function that takes in the path to a PDF file and returns a list containing the name of each exam passer.

This step yields an array of exam taker names, here is what the first element looks like:

'1. Abadie, Christopher

'

Now we are going to write a function that strips out everything but the first and last name.

Here is the first element of the array of formatted names.

['Abadie', 'Christopher']

Now we just need to figure out how to scrape the name from every single PDF. First we create a list of the paths to the files.

This makes an array called examFiles that contains the paths to the PDFs containing names of exam passers. Here is the first element of the examFiles array.

'edu-names-2018-modified/Exam C/edu-2018-02-c-names-ajl65e.pdf'

We then scrape every name from every file.

The first entry in the allNames array is:

['Ab Manan', 'Muhd Azman Firdaus', 'Exam C', '2018-02']

We can convert this list of lists into a pandas object, where we can aggregate and perform analysis. The following code revealed that 4793 students passed exam P, more than any other exam.

I have always heard that a big part of a data scientist’s job is getting data. This surprised me, did these data scientists not have a database to query from? Now I understand that someone has to pull the data from somewhere, and that getting data into a reasonable format is often a project of its own. The next steps in this project are to perform data quality checks, include more years, and make the CSV data available to all. This way interested parties can perform their own analyses to investigate the data without too much extra work.