Let’s get started!

Installation

Installation is quite straight-forward with the command below:

pip install dvc

To verify the installation, type dvc in the terminal and if you see a bunch of DVC command options, you are on the right track.

For the demonstration, I will be using the repository dvc-sample with the following project structure:

dvc-sample

├── artifacts

│ ├── dataset.csv

│ └── model.model

└── src

├── preprocessor.py

└── trainer.py

The repository has a simple structure; there is an src folder which will have the python scripts(version controlled by git) and artifacts folder which will have all the datasets, model files and rest of the artifacts(which are bigger and need to be controlled by dvc ).

Initializing dvc

The first thing we have to do is to initialize dvc in the root of the project. We do it with the command below:

dvc init

(This is very similar to git init, we only have to do it once while setting up the project)

At this point, we have added dvc support to the project. But we still have to specify the folders which we want to version control using dvc . In this example, we will be versioning artifacts folder. We do it using the command below:

dvc add artifacts

The above statement did two things -:

Specify which folder we want to track using dvc

(Creating a metafile artifacts.dvc ) Add the same folder to .gitignore

(As we don’t want to track the folder with git anymore)

After executing the above command, dvc tells us to add the above two files to git

Now we add these files to git using the commands below:

git add .

git commit -m 'Added dvc files'

Note: An important thing to note here is: the meta-files of artifacts folder are tracked by git and actual artifacts files are tracked by dvc . In this case, artifacts.dvc is tracked by git, and contents inside artifacts folder are tracked by dvc.

It's okay if this is not very clear now, we will look at it closely later on.

At this point, we have added dvc to our project along with git and have also added the folder which we want to track using dvc.

Now let’s look at a typical Machine Learning workflow(simplified version):

We have a dataset We do some preprocessing on the above dataset using a python script We train a model using a python script We have a model file which is the output of step #3

Above is a repetitive process; as we use multiple datasets, with a different set of preprocessing pipelines, to build and test various Machine Learning models. And this is what we want to version control in order to easily reproduce the previous versions whenever required.

For the above scenario, we are tracking #2 and #3 using git as these are smaller code files. And track #1 and #4 using dvc , as these could be pretty big in size(up to a few GBs)

Have a look at the directory structure again for more clarity:

dvc-sample

├── artifacts

│ ├── dataset.csv #1

│ └── model.model #4

└── src

├── preprocessor.py #2

└── trainer.py #3

For simplicity, at any given point — the content of each of the above 4 files will be the version they belong to.

Let’s say we have written 1st version of the preprocessor and training scripts which were used on a dataset to build the model. The 4 files look like this right now:

State of files at Version 1

Tracking large files

Now we have to commit our code and the artifacts(dataset and model files), we do it in 3 steps:

We track changes in artifacts using dvc

dvc add artifacts/

(This tracks the latest version of files inside artifacts folder and modifies the meta-file artifacts.dvc )

2. We track changes in code scripts and updated meta-file(artifacts.dvc) using git

git add .

git commit -m 'Version 1: scripts and dvc files'

3. Tag this state of the project as experiment01 using git

(This will help us to roll back to a version later)

git tag -a experiment01 -m 'experiment01'