Implementing a more realistic workflow

As you progress through this task, new concepts of Metaflow are introduced and explained where applicable.

The Task

In this flow, we will implement a workflow which

Ingests a CSV into a Pandas Dataframe Computes quartiles statistics for various genres in parallel Save a dictionary of genre-specific statistics.

—

The Flow

Below is a skeleton class which helps you see the general flow of things.

A few concepts introduced in the skeleton class above

On line 24 in start step, notice the foreach parameter? foreach executes parallel copies of compute_statistics steps inside a for each loop for every entry in the genres list .

step, notice the parameter? . On line 26, the @catch(var='compute_failed') decorator will catch any exception occurred in the compute_statistics step and assign it to a compute_failed variable (which can be read by its successor)

and assign it to a variable (which can be read by its successor) On line 27, the @retry(times=1) decorator does what it implies — retrying the step in case any errors arise.

— retrying the step in case any errors arise. On line 31 in compute_statistics , where does the self.input magically appear from? input is a class variable provided by Metaflow which contains the data applicable for this instance of compute_statistics (when there are multiple copies of this function running in parallel? ) It is only added on by Metaflow when a node branches out to multiple parallel processes, or when multiple nodes are merged into one.

, where does the magically appear from? ) It is only added on by Metaflow when a node branches out to multiple parallel processes, or when multiple nodes are merged into one. This example task only shows multiple parallel runs of the same compute_statistics function. But for those curious, it is possible to kick off completely different and unrelated functions in parallel. For that, you can change line 24 to self.next(self.func1, self.function2, self.function3) . Of course, you would have to update the join step accordingly to handle that case.

Here’s a visual representation of the skeleton class.

Visual Flow

—

Reading in a data file and custom args

Download this csv file with movie data prepared by Metaflow.

Now we want to support dynamically passing in the movie_data file path and a max_genres value to our workflow as external arguments. Metaflow lets you pass args by appending additional flags in your run command. For e.g. python3 tutorial_flow.py run --movie_data=path/to/movies.csv --max_genres=5

file path and a value to our workflow as external arguments. For e.g. To read in such custom inputs within the workflow, Metaflow provides IncludeFile and Parameter objects. We access the passed in arguments by assigning an IncludeFile or Parameter object to a class variable, depending on whether we are reading a file or a regular value.

Reading in custom parameters passed through CLI

—

Injecting Conda to the flow

Complete the steps outlined in the Conda setup section above.

section above. Add the @conda_base decorator provided by Metaflow to your Flow’s Class. It expects a python version to be passed in, which can either be hardcoded or provided through a function like it is done below.

Injecting Conda to the Flow

Now you can add the @conda decorator to any step in your flow. It expects an object with dependencies passed through the libraries parameter. Metaflow will take care of preparing the containers with those dependencies before executing the step. It is okay to use different versions of a package in different steps since Metaflow runs each step in separate containers.

New run command: python3 tutorial_flow.py --environment=conda run

—

Implementing ' start’ step

Implementation of `start` step

A few things to observe here:

Notice how the pandas import statement exists within the step function? That’s because it is injected by conda only within the scope of this step.

That’s because it is injected by conda only within the scope of this step. However, the variables defined here ( dataframe & genres ) are accessible even by the steps that run after this. This is because Metaflow operates on the principle of isolating the run environment but allowing the data to flow naturally.

—

Implementing the ' compute_statistics’ step

Implementation of `compute_statistics` step

Notice that this step is accessing and modifying the dataframe variable defined in the previous start step. Moving forward to the next steps, this new modified dataframe will be effective.

—

Implementing the ‘join’ step

Implementation of `join` step

Two things to observe here:

We are using a completely different version of the pandas library on this step.

library on this step. Each index in the inputs array represents a copy of compute_statistics that was run before this. It contains the state of that run i.e. values of the various variables. So, input[0].quartiles can contain the quartiles for the comedy genre whereas input[1].quartiles can contain the quartiles for the sci-fi genre.

—

Final Code Artifacts

The flow developed in this demo is available in my Tutorials Repo.

In order to see its flow design:

python3 tutorial_flow.py --environment=conda show

To run it:

python3 tutorial_flow.py --environment=conda run --movie_data=path/to/movies.csv --max_genres=7

—

Inspecting runs through Client API

You can leverage the Client API that Metaflow provides in order to inspect the data and state snapshots of your past runs. It is ideal for exploring details about your historical runs on a Notebook.

Below is a simple snippet where I print the genre_stats variable from the last successful run of GenreStatsFlow .