By: Selvakumar Ulaganathan, PhD





As a Data Scientist, I often find it hard to share my entire 100 GB of project with my colleagues. But, hang on, it is least of my many problems when working on a machine learning project ! Yes, that's true. I have many other problems such as the followings to name a few !

how to connect versions of source codes and versions of large data files?

how to recover model from weeks earlier without retraining it ?

how to run only model inferencing using a model that I built weeks ago?

how to keep a track of model parameters of various ML experiments?

The list keeps growing !

However, I recently came across a very interesting tool called Data Version Control (DVC) that is being demonstrated as a version control system for Machine Learning projects. Although many commercial tools (e.g., Azure pipeline feature) address some of the above listed problems, they are all commercial! On the contrary, DVC is completely open-sourced that can solve most of my and your problems.

In this article, I am gonna shortly demonstrate how you can easily make use of this tool to effectively manage your machine learning laboratory. For detailed information, you can visit the official DVC site.





GitHub project

You can find an example project in my personal GitHub or DAGsHub repository to try on your own.





Dataset

The whole framework is very generic, and can be applied to data emanating from any sector. Thus, let the dataset used in this article not discourage you from employing this framework for your case. However, for the purpose of demonstration, I have used a dataset which represent the aerodynamic polar data of RAE2822 airfoil.





Prerequisite

Basic working knowledge of any version control system such as Git.

Working instance of DVC (official installation guide).





Workflow

For the purpose of demonstration, I am gonna take a project that falls under a simple Machine Learning pipeline:

I have the following files which correspond to each stage:

getData.py

processData.py

trainModel.py

scoreModel.py

Let us create a git repository, and add all the source files to the git host (GitHub in our case). Afterwards, we can practice any valid git operation as usual.

git init git add "add-all-source-files" git commit -m "source files are added" git remote add origin "give-ur-repositary-url" git push -u origin master

Now, let us create a dvc repository for the data files, model artifacts, and any other intermediate results. DVC supports various type of remotes such as Amazon S3, Azure blob, Google cloud storage, etc. In our case, it is gonna be a regular folder in local storage.

dvc init dvc remote add -d dvcRemoteLocal .../dvcStorage/MLPipelineDVC

Let us take the raw input data to our dvc repository, and create a Machine Learning pipeline as simple as it is shown below. The various stages of the pipeline are connected by roping the 'dependency', the 'command' to run the script, and the 'outcome' of the stage via the 'dvc run' command.

dvc add .../data/data_raw.csv dvc run -f .../src/getData.dvc -d .../src/getData.py -d .../data/data_raw.csv -o .../data/data_actual python .../src/getData.py dvc run -f .../src/processData.dvc -d .../src/processData.py -d .../data/data_actual -o .../data/data_processed python .../src/processData.py dvc run -f .../src/trainModel.dvc -d .../src/trainModel.py -d .../data/data_processed/dataTrain.csv -o .../outputs/model.pkl python .../src/trainModel.py dvc run -f .../src/scoreModel.dvc -d .../src/scoreModel.py -d .../outputs/model.pkl -d .../data/data_processed/dataTest.csv -M .../outputs/score.json python .../src/scoreModel.py git add . git commit -m "DVC pipeline files are now under GIT" git push origin dvc push

DVC, in the background, carries out various necessary changes to '.gitignore' to avoid raw data files (e.g., data, model artifacts, etc.) being copied to the git host. At the same time, it also writes a '*.dvc' file for each 'dvc add/run' command so that all the necessary files needed to reproduce the Machine Learning pipeline are effectively tracked by git without actually hosting them in its repository. Instead, they are being hosted in DVC remote that we set up earlier (it could be a local, a cloud or any other DVC supported storage).

Now we have a pipeline set up, the effect of any changes in any one of the stages of the pipeline can actually be produced with a following single command.

dvc repro .../src/scoreModel.dvc

Further, creating another machine learning experiment without disturbing the existing one can be as simple as the following (note: most of the following commands are for creating a git branch and pushing it to the remote):

git checkout -b GBoost dvc checkout <make necessary changes in any stage; in this case , in training stage> git repro .../src/scoreModel.dvc git add . git commit -m "training Gradient Boosting" git push -u origin GBoost dvc push

That is all. Now, only the stages which need to be executed (training & scoring in this case) are executed without unnecessarily executing all the stages of the pipeline. You can visualize the score value across the branches easily.

dvc metrics show -a





You can also visualize the pipeline in various ways. In this case below, I have provided the pipeline created by my DAGsHUB repository.





Furthermore, sharing and reproducing the whole pipeline with your colleagues become extremely easy. Your colleagues can access the project without performing any model training as below.

< get git repository> dvc pull dvc repro .../src/scoreModel.dvc <to switch between branches> git checkout GBoost dvc checkout dvc repro .../src/scoreModel.dvc

Although a lot can be discussed further (e.g., sharing and collaboration), I will end right here. This tool can definitely assist you in maintaining your Machine Learning Pipelines in an efficient way.





About the Author:

The author, Selvakumar Ulaganathan, holds a PhD in Engineering with a specialized focus on 'Machine Learning for Engineering' problems. He is also specialized in machine learning models based engineering design, statistical modeling, descriptive analysis, and optimization. His interests include creating end-to-end machine learning and data-science pipelines to solve real-life problems.

Linkedin: https://be.linkedin.com/in/selvakumar-ulaganathan-93694826

GitHub: https://github.com/selvaHome

DAGsHub: https://dagshub.com/selvaHome

Google Scholar: https://scholar.google.be/citations?user=aNGpmfkAAAAJ&hl=en