Pre-Fetched Input Data

Pachyderm 1.3 pipelines no longer use FUSE for job execution. Rather, they download input data directly to disk and write output data directly to disk (before uploading it). This provides a significant speedup for pipelines. Using this benchmark, job execution times improved by about 4x!

Put Files via Object Store URLs

The pachctl CLI now supports putting files into Pachyderm data versioning system (PFS) via object store URLs. Pachyderm 1.3 supports the use of s3:// , gcs:// , and as:// URLS. For example, to put a file directly from S3, you could use:

pachctl put-file <repo> <branch> -f s3://url_path

This way you don’t have to worry about downloading S3 files locally or creating some service that serves your files out of S3 via HTTP.

Push Images when Creating or Updating a Pipeline

In many cases, especially during development, our users want to update their code (and thus their Docker image(s)) and re-run their pipeline with the new code. Pachyderm 1.3 makes this a quite a bit easier.

To create or update a pipeline, for which Pachyderm has already pulled images, you just need to build your new docker image and then call “create-pipeline” or “update-pipeline” with the — push-images flag. For example,

pachctl update-pipeline -f pipeline.json — push-images

When this is called, Pachyderm will tag the newly built image, update the pipeline spec on the server with the newly tagged image name, and re-run the pipeline with the new image.

Support for all Docker Images

We are very happy to announce that you no longer have to ensure that your custom images inherit Pachyderm’s “job-shim” functionality, or any Pachyderm specific functionality for that matter. You can use your favorite Docker images without modification as long as they have cp and sh functionality (so basically any linux-based, non-scratch images), and even these requirements ( cp and sh ) will be removed soon.

With this enhancement, a valid Docker image for use with Pachyderm can simply look like this:

FROM ubuntu # get up pip, vim, etc.

RUN apt-get -y update

RUN apt-get install -y python-pip python-dev libev4 libev-dev gcc libxslt-dev libxml2-dev libffi-dev vim curl

RUN pip install — upgrade pip # get numpy, scipy, and scikit-learn

RUN apt-get install -y python-numpy python-scipy

RUN pip install pandas

RUN pip install scikit-learn # add our project

ADD . /

without any explicit indications or modifications specific to Pachyderm. You no longer need to use FROM pachyderm/job-shim:latest or explicitly include the “job-shim” binary. You can build directly from ubuntu, alpine, or your own custom data science base image.

Of course, if you already have images inheriting “job-shim” they will continue to work with Pachyderm 1.3.

Install Pachyderm 1.3 Today

For more details check out the changelog. To try the new release for yourself, install it now or migrate your existing Pachyderm deployment. Also be sure to:

Finally, we would like to thank all of our amazing users who helped shaped these enhancements, file bug reports, and discuss Pachyderm workflows and, of course, all the contributors who helped us realize 1.3!