Published: Sat 08 April 2017 In Research. tags: English Spiderpig

During my Ph.D. I have been doing a lot of data analysis, mainly exploratory. The data I have been working with are small enough to fit in memory (16 GB), so I can easily use standard Python stack: Pandas, SciPy, Seaborn, Matplotlib, ... On the other side, the data are big enough to make my analysis time-intensive (from minutes to days). I am also forced to run the same analysis using several different data sets (I process data from AB experiments and several educational systems developed within our research group) and several different parameter settings. Based on my needs I have developed a small library which helps me with these issues. This library is called spiderpig and I have spend a few days to make it accessible also to other people than me. Both source code and documentation are now available.

Configuration

You probably know the situation when you have data and you must decide how to filter them (e.g., which users you will exclude because of insufficient data) them and how to set parameters for your analysis (e.g., number of iterations for bootstrapping or meta-parameters of your predictive model). If you have hierarchy of functions (load → filter → preprocess ... → analyze), it is really pain to pass all these parameters across the code. Spiderpig allows you to store the configuration in one place and use it automatically.

Imagine we have the following configuration stored in config.yaml :

who : world

Spiderpig is able to inject the configuration property who to all functions annotated by spiderpig.configured decorator. This configuration can be locally overridden by spiderpig.configuration context. If the parameter is given when the function is invoked, the global configuration is also overridden including all nested invocations of other spiderpig functions.

import spiderpig as sp @sp.configured () def hello ( who = None ): print ( 'Hello' , who ) @sp.configured () def interview ( interviewer , who = None ): print ( interviewer , end = ': ' ) hello () # you can also use spiderpig.init function without "with" with sp . spiderpig ( config_file = 'config.yaml' ): hello () with sp . configuration ( who = 'universe' ): hello () hello ( 'everybody' ) interview ( 'God' , 'Jesus' )

Resulting output:

Hello world Hello universe Hello everybody God: Hello Jesus

Instead of specifying the configuration file, the parameters can be passed directly to spiderpig.spiderpig (or spiderpig.init ):

with sp . spiderpig ( who = 'world' ): ...

Caching

I assume that everybody who analyze data somehow cache preprocessed data or intermediate results. I saw (in academia) people who implemented it on their own every time when they needed it. Spiderpig provides a simple caching mechanism compatible with the global configuration feature. Of course, it assumes that functions decorated to be cached have no side-effects.

In the case you want use caching, you have to specify the working directory. Imagine the following scenario where we try to compute the Fibonacci numbers:

import spiderpig as sp @sp.cached () def fibonacci ( n = 0 ): print ( 'Computing fibonacci({})' . format ( n )) if n <= 1 : return 1 return n * fibonacci ( n - 1 ) sp . init ( n = 5 ) print ( fibonacci ()) print ( fibonacci ( 6 ))

Resulting output:

Computing fibonacci(5) Computing fibonacci(4) Computing fibonacci(3) Computing fibonacci(2) Computing fibonacci(1) 120 Computing fibonacci(6) 720

Command-line Interface

The last feature of spiderpig relates to the process of analysis execution. I really like Jupyter notebooks, but this technology has many limitations. When you use notebooks together with git , they produce ugly and unclear diffs (yes, I look and want to look at diffs in command line). Also you can not run the analysis with several different parameter settings in batch. From this reason, I usually prefer command-line interface. Spiderpig helps you with structuring the source code of your analysis and automatically generates command-line interface.

Imagine you want to create a crawler (the full example is available on GitHub). Let's start with a command which downloads and prints a HTML page from its URL. For the crawler we will build the following directory structure:

. ├── crawler.py ├── general │ ├── commands │ │ ├── __init__.py │ │ └── url_html.py │ ├── __init__.py │ └─── model.py └── wikipedia ├── commands │ ├── __init__.py │ └── intro.py └─── __init__.py

For the time being, let's completely ignore the wikipedia package. Firstly, we create a function which downloads HTML as a plain text and than we pass this plain text to BeautifulSoup to make the output more beautiful. We will put this code into the general/model.py file.

from bs4 import BeautifulSoup from spiderpig.msg import Verbosity , print_debug from urllib.request import urlopen import spiderpig as sp @sp.cached () def load_page_content ( url , verbosity = Verbosity . INFO ): if verbosity > Verbosity . INFO : print_debug ( 'Downloading {}' . format ( url )) return urlopen ( url ) . read () def load_html ( url ): return BeautifulSoup ( load_page_content ( url ), 'html.parser' )

Please, notice that the load_page_content function is annotated by the spiderpig.cached decorator.

Secondly, we create a command itself. For spiderpig, commands are all modules from the specified package containing an execute function. We put the source code of the command into the general/commands/url_show.py file.

""" Download HTML web page from the specified URL and print it on the standard output or to the specified file. """ from .. import model import os def execute ( url , output = None ): if not url . startswith ( 'http' ): url = 'http://' + url html = model . load_html ( url ) . prettify () if output : directory = os . path . dirname ( output ) if not os . path . exists ( directory ): os . makedirs ( directory ) with open ( output , 'w' ) as f : f . write ( html ) else : print ( html )

Finally, we create an executable file crawler.py :

#!/usr/bin/env python from spiderpig import run_cli import general.commands import wikipedia.commands run_cli ( command_packages = [ general . commands ], )

If you have more packages with commands, it is useful to prefix them with namespace:

run_cli ( command_packages = [ general . commands ], namespaced_command_packages = { 'wiki' : wikipedia . commands } )

Spiderpig automatically loads your commands and make them accessible for you.

$ ./crawler.py --help usage: crawler.py [ -h ] [ --cache-dir CACHE_DIR ] [ --override-cache ] [ --verbosity { 0 ,1,2,3 }] [ --max-in-memory-entries MAX_IN_MEMORY_ENTRIES ] { url-html,wiki-intro,spiderpig-executions } ... positional arguments: { url-html,wiki-intro,spiderpig-executions } url-html Download HTML web page from the specified URL and print it on the standard output or to the specified file. wiki-intro Download and print the first paragraph from Wikipedia for the given keyword. spiderpig-executions optional arguments: -h, --help show this help message and exit --cache-dir CACHE_DIR --override-cache --verbosity { 0 ,1,2,3 } --max-in-memory-entries MAX_IN_MEMORY_ENTRIES $

It automatically creates argparse configuration parsers from parameters of your execute function:

$ ./crawler.py url-html --help usage: crawler.py url-html [ -h ] [ --output OUTPUT ] --url URL optional arguments: -h, --help show this help message and exit --output OUTPUT default: None --url URL $

Using debugging prints, we can easily check that the caching works as we expect:

$ ./crawler.py --verbosity 1 url-html --url google.com --output /dev/null Downloading http://google.com $ ./crawler.py --verbosity 1 url-html --url google.com --output /dev/null $

Conclusion

If you find spiderpig library useful, but you experience a bug or you have an idea to improve it, do not hesitate to create an issue on GitHub. If you know any library solving similar issues, it would be also great to contact me (via twitter, or e-mail).