With the Pachyderm 1.5 UI, or “dashboard,” you can:

Explore your versioned data — interactively explore various “data repositories” that organize and manage versions of the data flowing through your pipelines.

— interactively explore various “data repositories” that organize and manage versions of the data flowing through your pipelines. Visualize your DAG —automatically visualize the structure of your declared DAG pipeline and analyze it interactively.

—automatically visualize the structure of your declared DAG pipeline and analyze it interactively. Track your pipelines —investigate pipeline statuses, runs, and details (e.g., Docker images and commands associated with pipelines).

The Pachyderm UI is a feature that is helping enhance Pachyderm for true enterprise usage. As such, the UI will be part of a new Pachyderm Enterprise Edition that focuses on production use cases. For more information on Pachyderm Enterprise Edition, please email us at support@pachyderm.io or chat with us on our public Slack.

Resource Specification, Including GPU Support

Pachyderm 1.5 allows you to accelerate your model training and/or better schedule compute intensive pipelines. For example, if you were developing a machine learning pipeline, you might have a training stage, scoring or inference stage, visualization stage etc. With Pachyderm 1.5, you can optionally offload the training stage of that ML pipeline to a GPU node for big performance gains.

More generally, you can specify exact CPU, GPU, and/or memory resources for any Pachyderm 1.5 pipeline. This ensures that pipelines are scheduled efficiently and with enough resources, which is particularly important as your data science/engineering organization grows and must share resources across a cluster.

Expanded Data Combinations and Management

Pachyderm 1.5 makes combining data sources easier and minimizes inefficient data transfers.

Pachyderm 1.5 allows you to combine data from various sources using the flexible and familiar primitives cross and union . For example, if you need to test ML models across a huge number of parameters, you could “cross” your training data with your parameters and distribute the testing for all combinations of those parameters. This reduces the time needed to set up distributed processing of various data sources (e.g., for parameter tuning) and let’s data scientists focus their time on model development.

In addition, Pachyderm 1.5 takes space efficient data management to a whole new level. For workflows that require you to shuffle data (e.g., arranging into time-windowed buckets) or copy data from one repository to another, Pachyderm 1.5 let’s you perform those shuffles or copies without creating any duplicate data. This minimizes network traffic and reduces inefficient data transfers. Pachyderm 1.5 also gives you explicit control over garbage collecting deleted files, data repositories, commits, etc.

Auto-scaling

Pachyderm 1.5 reduces the cost of and contention for cluster resources.

Pachyderm 1.5 adds full support for auto-scaling at the Pachyderm worker level that can complement cloud auto-scaling. Pachyderm 1.5 allows you to specify a threshold, which will let Pachyderm scale down idle workers after a certain period of time.

This scale down of active workers can dramatically reduce the cost of resources when you are processing bursts of data and/or when you are performing large distributed batch jobs one a day, one a month, etc. You can scale up Pachyderm workers automatically when you need them and scale them down when they are idle.

Install Pachyderm 1.5 Today

For more details check out the changelog. To try the new release for yourself, install it now or migrate your existing Pachyderm deployment. Also be sure to:

Finally, we would like to thank all of our amazing users who helped shaped these enhancements, file bug reports, and discuss Pachyderm workflows and, of course, all the contributors who helped us realize 1.5!