DVC or Data Version Control tool — its idea is to track files/data dependencies during model development in order to facilitate reproducibility and track data files versioning. Most of the DVC tutorials provide good examples of using DVC with Python language. However, I realized that DVC is a language agnostic tool and can be used with any programming language. In this blog post, we will see how to use DVC in R projects.

R coding — keep it simple and readable

Each development is always a combination of following steps presented below:

Model development process

Because of the specificity of the process — iterative development, it is very important to improve some coding and organizational skills. For example, instead of having one big R file with code it is better to split code in several logical files — each responsible for one small piece of work. It is smart to track history development with git tool. Writing “reusable code” is nice skill to have. Put comments in a code can make our life easier.

Beside git, next step in further improvements is to try out and work with DVC. Every time when a change/commit in some of the codes and data sets is made, DVC will reproduce new results with just one bash command on a linux (or Win environment). It memorizes dependencies among files and codes so it can easily repeat all necessary steps/codes instead of us worrying about the order.

R example — data and code clarification

We’ll take an Python example from DVC tutorial (written by Dmitry Petrov) and rewrite that code in R. With an example we’ll show how can DVC help during development and what are its possibilities.

Firstly, let’s initialize git and dvc on mentioned example and run our codes for the first time. After that we will simulate some changes in the codes and see how DVC works on reproducibility.

R codes can be downloaded from the Github repository. A brief explanation of the codes is presented below:

parsingxml.R — it takes xml that we downloaded from the web and creates appropriate csv file.

traintestspliting.R — stratified sampling by target variable (here we are creating test and train data set)

featurization.R — text mining and tf-idf matrix creation. In this part we are creating predictive variables.

train_model.R — with created variables we are building logistic regression (LASSO).

evaluate.R — with trained model we are predicting target on test data set. AUC is final output which is used as evaluation metric.

Firstly, codes from above we will download into the new folder (clone the repository):

$ mkdir R_DVC_GITHUB_CODE $ cd R_DVC_GITHUB_CODE $ git clone https://github.com/Zoldin/R_AND_DVC

DVC installation and initialization

On the first site it seemed that DVC will not be compatible to work with R because of the fact that DVC is written in Python and as that needs/requires Python packages and pip package manager. Nevertheless, the tool can be used with any programming language, it is language agnostic and as such is excellent for working with R.

Dvc installation:

$ pip3 install dvc $ dvc init

With code below 5 R scripts with dvc run are executed. Each script is started with some arguments — input and output file names and other parameters (seed, splitting ratio etc). It is important to use dvc run — with this command R script are entering pipeline (DAG graph).

$ dvc import https://s3-us-west-2.amazonaws.com/dvc-public/data/tutorial/nlp/25K/Posts.xml.zip \ data/ $ dvc run tar zxf data/Posts.xml.tgz -C data/ $ dvc run Rscript code/parsingxml.R \ data/Posts.xml \ data/Posts.csv $ dvc run Rscript code/train_test_spliting.R \ data/Posts.csv 0.33 20170426 \ data/train_post.csv \ data/test_post.csv $ dvc run Rscript code/featurization.R \ data/train_post.csv \ data/test_post.csv \ data/matrix_train.txt \ data/matrix_test.txt $ dvc run Rscript code/train_model.R \ data/matrix_train.txt 20170426 \ data/glmnet.Rdata $ dvc run Rscript code/evaluate.R \ data/glmnet.Rdata \ data/matrix_test.txt \ data/evaluation.txt $ cat data/evaluation.txt

Dependency flow graph on R example

Dependency graph is shown on picture below:

Dependency graph

DVC memorizes this dependencies and helps us in each moment to reproduce results.

For example, lets say that we are changing our training model — using ridge penalty instead of lasso penalty (changing alpha parameter to 0 ). In that case will change/modify train_model.R job and if we want to repeat model development with this algorithm we don’t need to repeat all steps from above, only steps marked red on a picture below:

DVC knows based on DAG graph that changed train_model.R file will only change following files: Glmnet.RData and Evaluation.txt . If we want to see our new results we need to execute only train_model.R and evaluate.R job . It is cool that we don’t have to think all the time what we need to repeat (which steps). dvc repro command will do that instead of us. Here is a code example :

$ vi train_model.R $ git commit -am "Ridge penalty instead of lasso" $ dvc repro data/evaluation.txt Reproducing run command for data item data/glmnet.Rdata. Args: Rscript code/train_model.R data/matrix_train.txt 20170426 data/glmnet.Rdata Reproducing run command for data item data/evaluation.txt. Args: Rscript code/evaluate.R data/glmnet.Rdata data/matrix_test.txt data/evaluation.txt $ cat data/evaluation.txt "AUC for the test file is : 0.947697381983095"

dvc repro always re executes steps which are affected with the latest developer changes. It knows what needs to be reproduced.

DVC can also work in an "multi-user environment” . Pipelines (dependency graphs) are visible to others colleagues if we are working in a team and using git as our version control tool. Data files can be shared if we set up a cloud and with dvc sync we specify which data can be shared and used for other users. In that case other users can see the shared data and reproduce results with those data and their code changes.

Summary

DVC tool improves and accelerates iterative development and helps to keep track of ML processes and file dependencies in the simple form. On the R example we saw how DVC memorizes dependency graph and based on that graph re executes only jobs that are related to the latest changes. It can also work in multi-user environment where dependency graphs, codes and data can be shared among multiple users. Because it is language agnostic, DVC allows us to work with multiple programming languages within a single data science project.