A few months ago, one of my friends asked me if I could help him extract some data from a collection of PDFs. The PDFs contained records of his financial transactions over a period of years and he wanted to analyze them. Unfortunately, Excel and plain text versions of the files were no longer available, so the PDFs were his only option.

I reviewed a few Python-based PDF parsers and decided to try Tika, which is a port of Apache Tika. Tika parsed the PDFs quickly and accurately. I extracted the data my friend needed and sent it to him in CSV format so he could analyze it with the program of his choice. Tika was so fast and easy to use that I really enjoyed the experience. I enjoyed it so much I decided to write a blog post about parsing PDFs with Tika.

California Budget PDFs

To demonstrate parsing PDFs with Tika, I knew I’d need some PDFs. I was thinking about which ones to use and remembered a blog post I’d read on scraping budget data from a government website. Governments also provide data in PDF format, so I decided it would be helpful to demonstrate how to parse data from PDFs available on a government website. This way, with these two blog posts, you have examples of acquiring government data, even if it’s embedded in HTML or PDFs. The three PDFs we’ll parse in this post are:

2015-16 State of California Enacted Budget Summary Charts

2014-15 State of California Enacted Budget Summary Charts

2013-14 State of California Enacted Budget Summary Charts

Each of these PDFs contains several tables that summarize total revenues and expenditures, general fund revenues and expenditures, expenditures by agency, and revenue sources. For this post, let’s extract the data on expenditures by agency and revenue sources. In the 2015-16 Budget PDF, the titles for these two tables are:

2015-16 Total State Expenditures by Agency

2015-16 Revenue Sources

To follow along with the rest of this tutorial you’ll need to download the three PDFs and ensure you’ve installed Tika. You can download the three PDFs here:

http://www.ebudget.ca.gov/2015-16/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

http://www.ebudget.ca.gov/2014-15/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

http://www.ebudget.ca.gov/2013-14/pdf/Enacted/BudgetSummary/SummaryCharts.pdf

You can install Tika by running the following command in a Terminal window:

pip install --user tika

IPython

Before we dive into parsing all of the PDFs, let’s use one of the PDFs, 2015-16CABudgetSummaryCharts.pdf, to become familiar with Tika and its output. We can use IPython to explore Tika’s output interactively:

ipython

from tika import parser

parsedPDF = parser.from_file("2015-16CABudgetSummaryCharts.pdf")

You can type the name of the variable, a period, and then hit tab to view a list of all of the methods available to you:

parsedPDF.

There are many options related to keys and values, so it appears the variable contains a dictionary. Let’s view the dictionary’s keys:

parsedPDF.viewkeys()

parsedPDF.keys()

The dictionary’s keys are metadata and content. Let’s take a look at the values associated with these keys:

parsedPDF["metadata"]

The value associated with the key “metadata” is another dictionary. As you’d expect based on the name of the key, its key-value pairs provide metadata about the parsed PDF.

Now let’s take a look at the value associated with “content”.

parsedPDF["content"]

The value associated with the key “content” is a string. As you’d expect, the string contains the PDF’s text content.

Now that we know the types of objects and values Tika provides to us, let’s write a Python script to parse all three of the PDFs. The script will iterate over the PDF files in a folder and, for each one, parse the text from the file, select the lines of text associated with the expenditures by agency and revenue sources tables, convert each of these selected lines of text into a Pandas DataFrame, display the DataFrame, and create and save a horizontal bar plot of the totals column for the expenditures and revenues. So, after you run this script, you’ll have six new plots, one for revenues and one for expenditures for each of the three PDF files, in the folder in which you ran the script.

Python Script

To parse the three PDFs, create a new Python script named parse_pdfs_with_tika.py and add the following lines of code:

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import csv

import glob

import os

import re

import sys

import pandas as pd

import matplotlib

matplotlib.use('AGG')

import matplotlib.pyplot as plt

pd.options.display.mpl_style = 'default'

from tika import parser

input_path = sys.argv[1]

def create_df(pdf_content, content_pattern, line_pattern, column_headings):

"""Create a Pandas DataFrame from lines of text in a PDF.

Arguments:

pdf_content -- all of the text Tika parses from the PDF

content_pattern -- a pattern that identifies the set of lines

that will become rows in the DataFrame

line_pattern -- a pattern that separates the agency name or revenue source

from the dollar values in the line

column_headings -- the list of column headings for the DataFrame

"""

list_of_line_items = []

# Grab all of the lines of text that match the pattern in content_pattern

content_match = re.search(content_pattern, pdf_content, re.DOTALL)

# group(1): only keep the lines between the parentheses in the pattern

content_match = content_match.group(1)

# Split on newlines to create a sequence of strings

content_match = content_match.split('

')

# Iterate over each line

for item in content_match:

# Create a list to hold the values in the line we want to retain

line_items = []

# Use line_pattern to separate the agency name or revenue source

# from the dollar values in the line

line_match = re.search(line_pattern, item, re.I)

# Grab the agency name or revenue source, strip whitespace, and remove commas

# group(1): the value inside the first set of parentheses in line_pattern

agency = line_match.group(1).strip().replace(',', '')

# Grab the dollar values, strip whitespace, replace dashes with 0.0, and remove $s and commas

# group(2): the value inside the second set of parentheses in line_pattern

values_string = line_match.group(2).strip().\

replace('- ', '0.0 ').replace('$', '').replace(',', '')

# Split on whitespace and convert to float to create a sequence of floating-point numbers

values = map(float, values_string.split())

# Append the agency name or revenue source into line_items

line_items.append(agency)

# Extend the floating-point numbers into line_items so line_items remains one list

line_items.extend(values)

# Append line_item's values into list_of_line_items to generate a list of lists;

# all of the lines that will become rows in the DataFrame

list_of_line_items.append(line_items)

# Convert the list of lists into a Pandas DataFrame and specify the column headings

df = pd.DataFrame(list_of_line_items, columns=column_headings)

return df

def create_plot(df, column_to_sort, x_val, y_val, type_of_plot, plot_size, the_title):

"""Create a plot from data in a Pandas DataFrame.

Arguments:

df -- A Pandas DataFrame

column_to_sort -- The column of values to sort

x_val -- The variable displayed on the x-axis

y_val -- The variable displayed on the y-axis

type_of_plot -- A string that specifies the type of plot to create

plot_size -- A list of 2 numbers that specifies the plot's size

the_title -- A string to serve as the plot's title

"""

# Create a figure and an axis for the plot

fig, ax = plt.subplots()

# Sort the values in the column_to_sort column in the DataFrame

df = df.sort_values(by=column_to_sort)

# Create a plot with x_val on the x-axis and y_val on the y-axis

# type_of_plot specifies the type of plot to create, plot_size

# specifies the size of the plot, and the_title specifies the title

df.plot(ax=ax, x=x_val, y=y_val, kind=type_of_plot, figsize=plot_size, title=the_title)

# Adjust the plot's parameters so everything fits in the figure area

plt.tight_layout()

# Create a PNG filename based on the plot's title, replace spaces with underscores

pngfile = the_title.replace(' ', '_') + '.png'

# Save the plot in the current folder

plt.savefig(pngfile)

# In the Expenditures table, grab all of the lines between Totals and General Government

expenditures_pattern = r'Totals

+(Legislative, Judicial, Executive.*?)

General Government:'

# In the Revenues table, grab all of the lines between 2015-16 and either Subtotal or Total

revenues_pattern = r'\d{4}-\d{2}

(Personal Income Tax.*?)

+[Subtotal|Total]'

# For the expenditures, grab the agency name in the first set of parentheses

# and grab the dollar values in the second set of parentheses

expense_pattern = r'(K-12 Education|[a-z,& -]+)([$,0-9 -]+)'

# For the revenues, grab the revenue source in the first set of parentheses

# and grab the dollar values in the second set of parentheses

revenue_pattern = r'([a-z, ]+)([$,0-9 -]+)'

# Column headings for the Expenditures DataFrames

expense_columns = ['Agency', 'General', 'Special', 'Bond', 'Totals']

# Column headings for the Revenues DataFrames

revenue_columns = ['Source', 'General', 'Special', 'Total', 'Change']

# Iterate over all PDF files in the folder and process each one in turn

for input_file in glob.glob(os.path.join(input_path, '*.pdf')):

# Grab the PDF's file name

filename = os.path.basename(input_file)

print filename

# Remove .pdf from the filename so we can use it as the name of the plot and PNG

plotname = filename.strip('.pdf')

# Use Tika to parse the PDF

parsedPDF = parser.from_file(input_file)

# Extract the text content from the parsed PDF

pdf = parsedPDF["content"]

# Convert double newlines into single newlines

pdf = pdf.replace('



', '

')

# Create a Pandas DataFrame from the lines of text in the Expenditures table in the PDF

expense_df = create_df(pdf, expenditures_pattern, expense_pattern, expense_columns)

# Create a Pandas DataFrame from the lines of text in the Revenues table in the PDF

revenue_df = create_df(pdf, revenues_pattern, revenue_pattern, revenue_columns)

print expense_df

print revenue_df

# Print the total expenditures and total revenues in the budget to the screen

print "Total Expenditures: {}".format(expense_df["Totals"].sum())

print "Total Revenues: {}

".format(revenue_df["Total"].sum())

# Create and save a horizontal bar plot based on the data in the Expenditures table

create_plot(expense_df, "Totals", ["Agency"], ["Totals"], 'barh', [20,10], \

plotname+"Expenditures")

# Create and save a horizontal bar plot based on the data in the Revenues table

create_plot(revenue_df, "Total", ["Source"], ["Total"], 'barh', [20,10], \

plotname+"Revenues")

Save this code in a file named parse_pdfs_with_tika.py in the same folder as the one containing the three CA Budget PDFs. Then you can run the script on the command line with the following command:

./parse_pdfs_with_tika.py .

I added docstrings to the two functions, create_df and create_plot, and comments above nearly every line of code in an effort to make the code as self-explanatory as possible. I created the two functions to avoid duplicating code because we perform these operations twice for each file, once for revenues and once for expenditures. We use a for loop to iterate over the PDFs and for each one we extract the lines of text we care about, convert the text into a Pandas DataFrame, display some of the DataFrame’s information, and save plots of the total values in the revenues and expenditures tables.

Results

Terminal Output

(1 of 3 pairs of DataFrames)

PNG File: Expenditures by Agency 2015-16

(1 of 6 PNG Files)

In this post I’ve tried to convey that Tika is a great resource for parsing PDFs by demonstrating how you can use it to parse budget data from PDF documents provided by a government agency. As my friend’s experience illustrates, there may be other situations in which you need to extract data from PDFs. With Tika, PDFs become another rich source of data for your analysis.