Note: This post/example has not been updated for the latest Pachyderm versions (1.4+). Please contact us (via email support@pachyderm.io or chat on our public Slack) to learn about 1.4+ Jupyter integrations.

Jupyter (and increasingly nteract) notebooks are ubiquitous in data science. They are shared between team members, referenced in blog posts, used to generate visualizations, and used to teach various data-related concepts. No doubt, these combinations of textual notes, pictures, and live code snippets are useful. However, as a friend once expressed to me:

“in some ways, Jupyter notebooks leave out one of the best attributes of a ‘scientific’ lab notebook: a [theoretically] permanent chronological record of work — preserving that record, in logical as well as chronological order, is a big step towards making [data] science more like science.”

In other words, the multi-format, exploratory functionality of Jupyter could be that much more powerful if there were a system, with which Jupyter could be paired, that would enable Jupyter notebooks to interact with chronological records of works and/or be versioned themselves. Such a system would go a long ways to enabling true scientific collaboration in both commercial and academic settings.

… enter Pachyderm! Pachyderm, with its data versioning plus data pipelining functionality, can expand the possibilities and increase the significance of applications like Jupyter and nteract by providing:

A logically and chronologically ordered record of analyses with which notebooks can interact (via Pachyderm’s data versioning and provenance functionality), and A way to version work done within notebooks themselves (i.e., to save the state of notebooks over time) along with all of the corresponding input/output data.

In addition, for those that are building a DAG of processing steps, implementing ETL pipelines, deploying machine learning models, etc., a Jupyter + Pachyderm system will allow engineers/scientists to attach interactive notebooks anywhere within a data flow and with access to any input/output data. They can then utilize the exploratory data analysis and visualization capabilities of Jupyter to debug complex pipelines or easily develop additional pipeline stages.

In this post, we will explore how Jupyter + Pachyderm can be utilized to explore and understand historical data analyses, which is related to the first point, (1), mentioned above. In a follow up post, we will explore point (2) and versioned notebooks.

An example chronologically ordered record of analyses via a Pachyderm data pipeline

In this post, we are going to imagine that we are working for a bikesharing company, like citibike. We track how many bike trips are taken each day on our service. Then we calculate our daily sales by multiplying that trip number by a trip price (e.g., $5.00). I’m sure this is NOT how citibike, or similar companies, calculate their sales, but it will give us a simple chronological data processing pipeline for this post.

Further, we are going to imagine that we are gathering some weather data for our company dashboard or some other internal service. That is, we are gathering this weather data for NYC daily, but not necessarily using it in our pipeline that calculates sales.

We will handle our data storage and processing with Pachyderm’s file system and pipelining system. The tracked counts of bike trips can be versioned using Pachyderm’s data versioning in a data repository called trips , and the daily weather data can be versioned in a data repository called weather . As we commit daily files into these data repositories, we are creating a versioned, chronologically ordered record of the trips and weather on any given day in the history of our analyses. Further, we can trigger a Pachyderm pipeline on new commits to trips , where the pipeline calculates our sales, or revenue, numbers and outputs results to another data repository called sales .

Altogether, the data repositories and processing steps look like this:

The pipeline specification defining the above processing, along with the actual program (and corresponding Docker image) used to calculate the sales, can be found here and are further explained here. The daily counts of bike trips were retrieved from citibike’s public data sets, and the weather data was gathered from the forecast.io (now Dark Sky) weather API.

A historical analysis problem we should investigate

After committing the trip and weather data into their respective data repositories and running our Pachyderm pipeline, we can plot our sales over time. When we do this, we find the following behavior (e.g., by manually plotting a sales.csv file generated by the pipeline with pandas):

There were a couple of days at the end of July (July 30th and 31st) that had particularly poor sales. How do we explain this behavior? How can we look back into our historical record of analyses and explore the situation on those days? Do the poor sales reflect an error in our processing or is there some more natural explanation?

Well, we can investigate all of these questions quite elegantly by attaching a Jupyter notebook to our data repositories. This will allow us to interactively explore, visualize, and manipulate the data at any state in history and at any points in our processing DAG.

Attaching a Jupyter notebook to the DAG at a certain point in history

Specifically, we can attach a Jupyter notebook to our versioned data using a Pachyderm service. This “service” will allow us to embed an application, in this case Jupyter, into Pachyderm. The embedded application will then have access to versioned data on particular commits of that data and can be accessed from outside of Pachyderm (i.e., in a browser).

We are going to attach to the sales and trips repo to try and diagnose why we are seeing low sales on July 30th and 31st. In addition, let’s attach to the weather repo, because we might suspect that the weather had something to do with the poor bike sharing sales on those days. Note, we could attach anywhere within a complex DAG using these methods, without having to have some pre-existing connection between the various pieces of the DAG to which we are attaching.

The job specification to launch the Jupyter service is as follows:

{

"service" : {

"internal_port": 8888,

"external_port": 30888

},

"transform": {

"image": "dwhitena/pachyderm_jupyter",

"cmd": [ "sh" ],

"stdin": [ "/opt/conda/bin/jupyter notebook" ]

},

"parallelism_spec": {

"strategy": "CONSTANT",

"constant": 1

},

"inputs": [

{

"commit": {

"repo": {

"name": "trips"

},

"id": "master/30"

}

},

{

"commit": {

"repo": {

"name": "weather"

},

"id": "master/30"

}

},

{

"commit": {

"repo": {

"name": "sales"

},

"id": "<output-commitid>/0"

}

}

]

}

We are attaching to the trips and weather repos on the master branch at commit number 30, which is the commit corresponding to July 31st (the last day of interesting poor sales). Also, <output-commitid> can be replaced by the output commit ID in the sales repo that corresponds to commit 30 on the input repo trips . In other words, the <output-commitid> is the sales results that have the “provenance” of the July 31st trips data (this commit ID can be found with pachctl flush-commit ). With these configurations, we are viewing a snapshot of the input/output data on our DAG along with the weather data at a point in time corresponding to the days of poor sales (July 30th and 31st).