Downloading datasets - Introducting PDL - Python Download Library

By Jan Van de Poel on Mar 15, 2018

It seems I am spending more and more of my days in Jupyter Notebooks lately. While still following the fast.ai course - sometimes life gets in the way of your plans - I noticed datasets are either included or there is a link to a .zip file, which you still need to download and extract by hand. After some manual repetitions, I started dabbling with a small script to make that part easier, so I could just focus on running experiments in the future.

Initially, I created a Bash script, but decided to create a Python version as well, both are detailed below.

Download and extract using Bash

I have been writing Bash scripts since as long as I can remember, but still, I do not find myself to be proficient. There is something about the language that does not resonate with me. It took some searching to see it was even possible, but it seems to be working quite nicely.

Running a Bash script in a Jupyter Notebook

In a previous post, I detailed how to run a line of Bash code in your Notebook:

!echo "Running a bash line in my notebook"

In order to run a complete script, however, we can not be adding ! before every line. To fix that, Jupyter Notebooks allow you to run a multiline script. All you have to do is start your script with %%:

%%bash # Allows you to run a multiline Bash script in your notebook.

To pass parameters from the Python Notebook to your script, you can simply define the variable in your Python code, and pass it to your Bash script using -s “$your_variable”:

In Python:

python_variable = "some value you want to use" some_other_variable = "a second value you are using in your Python"

In Bash:

%%bash -s "$python_variable" "$some_other_variable" echo $1 # Prints: "some value you want to use" echo $2 # Prints: "a second value you are using in your Python"

As you can see from the example above, the variables are passed with quotes and prefixed with $. Once passed, they will become available as $1, $2,… where the numbers are assigned based on their position when passed to the Bash script.

Important: The %%bash definition should be the very first entry in your cell, if you want to add a comment, you need to add it after the initial Bash statement or you will get a syntax error when you run the cell.

The script

%%bash -s "$download_dir" "$url" "$file" "$delete_download" "$path" # download_dir: $1 # url: $2 # file: $3 # delete_download: $4 # path: $5 if [ ! -f $1$3 ]; then wget -P $1 $2$3 else echo "file already exits, skipping download" fi # unzip the downloaded file to the downloaded directory unzip $1$3 -d $1 chmod -R 755 $5 if [ $4 ]; then echo "deleting" rm $1$3 else echo "not deleting" fi

The script above will check for the requested file from the url, and download and extract if necessary. It is, however, fairly rudimentary in downloading and unzipping files, with limited options and no .tar file support.

Download and extract using Python

After writing the Bash script, I decided to write a similar script in Python, just to see how easy it would be. Below you find a (slightly modified) Python version of the Bash download script. An important thing to note, is the stream parameter in the get method, which will make sure the download is streamed and not kept in memory in its entirety.

import requests import zipfile import os # download file resp = requests.get(f"{url}{file}", allow_redirects=True, stream=True) filename = f"{download_dir}{file}" zfile = open(filename, 'wb') zfile.write(resp.content) zfile.close() zipf = zipfile.ZipFile(filename, 'r') zipf.extractall(download_dir) zipf.close() os.remove(filename)

After finishing these scripts, I started playing around with the idea of a simple library that would do two things:

Easily download a dataset from a given url: Import the library, pass the dataset url and the library would take care of the rest, while giving you a set of parameters to control the process. Easily discover and download any public dataset: There are a lot of public datasets out there, but discovering them is not always easy. Adding helper methods for public datasets would allow developers to discover and easily add public datasets to their Python code.

Having never written a Python library, it would be an interesting way to learn that as well, turns out it is fairly simple to publish a Python module.

Introducing PDL - Python Download Library

Based on the scripts above, the first idea would be fairly easy to implement, but I needed to add some more (completely optional) forms of control:

data_dir: specify a directory for the dataset to be stored in;

keep_download: keep the downloaded files after download;

overwrite_download: re-download if the file already exists;

verbose: verbosity for debugging.

And finally, add support for other types of archives (.tar, .tgz, .tar.gz) and non-archive files.

Installation

You can easily install the library via:

$ pip install pdl

Core

At its core, the library contains one important method, which only requires a url to download the specified dataset:

from pdl import pdl # Download a file (zip, tar, tgz, tar.gz) pdl.download(url, data_dir="data/", keep_download=False, overwrite_download=False, verbose=False)

You can adjust the default parameters to suit your needs, but in the simplest case, you only need the url to get started.

Datasets

The second part of the API offers a shorthand method to download and extract public datasets, a one-liner to download and extract the data. Next to being super easy to use, it allows to explore and use publicly available datasets.

Currently a limited number of datasets is supported, but the list will be expanded over the coming days/weeks.

Below you can find the current supported datasets with their simplest invocation. Of course, you can still specify the parameters from the core: data_dir, keep_download, overwrite_download, verbose:

# Download cifar-10 (http://www.cs.utoronto.ca/~kriz/cifar.html) pdl.cifar_10() # Example of more control, which can also be applied to the datasets below: pdl.cifar_10(data_dir="my-data-dir/") pdl.cifar_10(data_dir="my-data-dir/", verbose=True) pdl.cifar_10(data_dir="my-data-dir/", overwrite_download=True, verbose=True) pdl.cifar_10(data_dir="my-data-dir/", keep_download=True, verbose=True) pdl.cifar_10(data_dir="my-data-dir/", keep_download=True, overwrite_download=True, verbose=True) pdl.cifar_10("my-data-dir/", True, True, True) # Download cifar-100 (http://www.cs.utoronto.ca/~kriz/cifar.html) pdl.cifar_100() # Download the Google Street View House (GSVH) numbers (http://ufldl.stanford.edu/housenumbers/) pdl.gsvh_cropped() # Download the Google Street View House (GSVH) numbers (http://ufldl.stanford.edu/housenumbers/) pdl.gsvh_full() # Download MNIST (http://yann.lecun.com/exdb/mnist/) pdl.mnist() # Download movie lens dataset(http://files.grouplens.org/datasets/movielens/) pdl.movie_lens_latest()

Helper methods

At the moment, the library also exposes two helper methods, which are used in the core, but might be useful when using the library:

from pdl import pdl # Get the file name from a url pdl.get_filename(url) # Get the location of a file pdl.get_file_location(data_dir, filename)

Feel like adding a dataset?

At the moment, the number of datasets is limited, but will be expanded over time. If you feel like adding a dataset yourself, you can open a pull request on Github with the following:

Add a helper method in the PDL core

Add a helper method in alphabetical order, and with a minimum of documentation. Be sure to run ./pylint.sh before creating the pull request.

Update the Readme

Add your helper method to the Readme, so people can easily find and start using the new dataset.

Where can I find it?

The library is open source and can be found on Github: Python Download Library - PDL.

Looking forward to hearing your feedback!

PDL is currently in preview and as always, your feedback is very welcome and valued. I’d love to hear from you on @zero2singularit with comments, improvements, or bugs…

PS: Kaggle Datasets

A lot of datasets are available on Kaggle, and to access them I am happily using the Kaggle-cli tool, which is super handy.