Today, I released tabula-py 0.3.0, which extracts table from PDF into Python pandas’s DataFrame.

It is simple wrapper of tabula-java and it enables you to extract table into DataFrame or JSON with Python. You also can extract tables from PDF into CSV, TSV or JSON file.

tabula is a tool to extract tables from PDFs. It is GUI based software, but tabula-java is a tool based on CUI. Though there were Ruby, R, and Node.js bindings of tabula-java, before tabula-py there isn’t any Python binding of it. I believe PyData is a great ecosystem for data analysis and that’s why I created tabula-py. If you are familiar with R, I highly recommend to use tabulizer, which has the most richest bindings including rich GUI.

You can install tabula-py via pip:

pip install tabula-py

With tabula-py, you can get DataFrame with read_pdf() method.

example of read_pdf()

You can also extract tables as JSON format:

example of JSON