We agree that not versioning data files at all was not a solution. We also concede that versioning these files using git wasn’t efficient enough because it would drastically increase the amount of data in the repository.

Solutions like git-lfs avoids such situation by using pointers and a remote storage. However, DVC provides the same principle, but it handles several common remote storage solutions (S3, Azure, GCP …). We’ll see that DVC also makes it easier to work on Data projects using stages and pipelines resulting in a significant gain in productivity and collaboration.

Challenging our initial setup

Most of our data projects are based on huge files containing the input data to build our codebase and train our learning algorithms. While we perform versioning of our codebase using git, we don’t want to embed the input data in our git repository thus we choose to untrack those for the following reasons:

the data weighs several gigabytes

git doesn’t provide added-value with binary files (non-textual such as pdf or images)

we need to share the data with non-git users for labeling purpose.

For collaboration purposes, we build a zip version of the input data available on our cloud storage service and refer to its location in the README of the project. Also, we keep track of our performance metrics in a spreadsheet, linking each new result to its associated git hash.

This setup is efficient enough for relatively small projects but suffers from missing data versioning when the project gets bigger and more complex.

1. Structure the project

Say at first our project is just a single process taking inputs and producing outputs. Then we might want to split this process in several sub-processes to be able to separate different concerns: working either on one or the other part of the project.

Input Data → Process → Results

becomes

Input Data → Process1 → Intermediate Data → Process2 → Results

Then not-versioning our data makes it tougher to identify which input data produced an intermediate data which then also produced a result data.

2. Reproduce a previous state of the project

At some point, we need to re-compute evaluation metrics for previous codebase versions. There is no ground truth to remember what was the state of the data at a previous milestone of the project. Indeed, we eventually added new training data during the project development phase without any record of this change.

3. Keep on tracking files metrics

For each experiment we run, the results are stored in a new metrics file prefixed by its timestamp. When we perform satisfying experiments, we then log the results in our project tracking spreadsheet. While running several experiments a day, we sometimes got lost trying to match a metrics file with its corresponding code improvement. To double check, we had to rerun the same experiments again.

Our use case

To make it more concrete, we illustrate this article with a project we worked on: VAT auto-detection from receipts. In short, it consists in automatically retrieving the value-added tax amount from a receipt document in order to simplify accounting work.

At start we developed using the setup described above. The input data comprises of a folder of documents and a spreadsheet containing ground-truth values. We used a single script to perform the whole VAT detection task.

Then at some point, our development iterations were too slow. Adopting a divide and conquer approach, we identified that processing splits in two independent parts:

extract the text content from a raw file into html files

from a raw file into html files retrieve the global VAT amount from the text content.

However, storing html files added complexity in the way we handle data. We needed to gather all these files (raw and pre-computed ones) and to adapt CLI (Command Line Interface) workflows. We decided to keep a single script and added options so that we could either:

compute the whole pipeline (both extraction and VAT retrieval) on any receipt

(both extraction and VAT retrieval) on any receipt perform extraction on receipts only and store the intermediate html

on receipts only and store the intermediate html compute VAT detection from intermediate documents

You can imagine that it made things a bit more complicated. We also faced exactly what we described in point 2 and 3 above.

Fortunately, we kept a low complexity throughout the project development. It helped us dealing with these shortcomings as we managed to reach the target performance. 🎉

However, we ended up with a clumsy procedure which would not fit a larger project where a spreadsheet could not glue it all. Thus we knew we needed a more rigorous way of versioning the input data, the intermediate files, and to associate the results metrics with the code in order to create stability for upcoming projects.

Using DVC to track project’s data (and increase productivity)

Broadly speaking, DVC (Data Version Control) acts as a layer over git which produces versioned pointers to the files instead of the files themselves. These files are finally stored in a local cache and this cache can be synchronized with a remote storage. In the next paragraphs, we elaborate on the main features of DVC and explain how we applied it to the VAT auto-detection project.

DVC is a python package which can be installed with pip.

pip install dvc

Then, at the project root path, execute dvc init the same way you init git. It will create a .dvc/ folder containing the dvc cache and some other files. We won’t focus on the role of the cache but just keep in mind that

it helps recovering file versions faster using reflinks

versions faster using reflinks The files contained in the cache are uploaded when syncing with a remote storage.

Dvc flow for a file model.pkl and its associated pointer model.pkl.dvc : the pointer is versioned using git while model.pkl is synced with a remote storage. (source:https://github.com/iterative/dvc)

Versioning data files

First we expect dvc to track input data. Using the command dvc add data/dataset.csv will basically ignore this file in git and create a pointer data/dataset.dvc which contains the checksum of the actual version of the dataset. You can find a clear list of the actions performed by the command in dvc documentation. We then need to track this pointer file under git by executing git add data/dataset.csv.dvc . This pointer effectively links the current state of the dataset with the codebase.

Data directory structure after tracking both dataset.csv and documents/ . It created a pointer for each of the resources and a .gitignore file.

Define project steps as stages

One of the main features of DVC is the definition of stages. A stage is a single command that has dependencies and produces outputs.

For example, we use the script train.py below to train a learning algorithm predicting if the document contains a VAT amount:

This training example has two dependencies. The dataset which is a .csv file and the documents which are accessed through the callback collect_document . The outputs are also twofold: the trained model as well as the model’s metrics.

Stages are run using dvc run [command] and options among which we use:

d for dependency : specify an input file

: specify an input file o for output : specify an output file ignored by git and tracked by dvc

: specify an output file ignored by git and tracked by dvc M for metric : specify an output file tracked by git

: specify an output file tracked by git f for file : specify the name of the dvc file.

: specify the name of the dvc file. command: a bash command, mostly a python script invocation

dvc \

-d data/dataset.csv \

-d data/documents/ \

-o vat_detection/has_vat_amount/assets/model.pkl \

-M metrics/has_vat_amount.json \

-f train.dvc \

python train.py

Running the above command will execute our python script and create a pointer file which basically looks like the following:

deps: dependencies of the stage ; outs: outputs including metrics, if cached then untracked by git ; md5: checksum of stage

Finally we just have to version the pointer file with git

git add train.dvc

Previously to DVC, we chose to perform the different computations using a single script with many options. Every experiment was producing numerous timestamp suffixed files.

Each step of the project now has its dedicated script with versioned inputs and outputs. This allows to hard-code inputs and outputs in the script: we no longer need to pass options at execution. This situation gives a lot more modularity.

Bundle stages into a pipeline

Outputs of one dvc stage become the dependencies of another dvc stage. A group of dvc stages sharing dependencies is called a pipeline. DVC tracks these dependencies:

dvc status indicates which stages have updated dependencies and thus need to be run again.

indicates which stages have updated dependencies and thus need to be run again. dvc repro re-runs all stages whose initial dependencies have changed.

Pipeline example for the project. Each blue box represents a stage ; extraction.dvc: pre-compute html version of documents ; split_dataset.dvc: split train and test data ; train.dvc: produce a learned model ; evaluate.dvc: assess performance on the test set.

Usage Limits

DVC provides a command which reruns each stage of the whole pipeline whose dependencies changed:

dvc repro evaluate.dvc

However most of the changes come from the codebase which, in our case, is not tracked by dvc, thus we chose not to use this feature.

For instance, if we update the document extraction procedure’s code, we will need to run the stage again. However, the only dependency of this stage is data/documents/ and it has not changed. So dvc does not detect we changed the code associated to this stage. To make dvc rerun the stage, we should add the codebase to the stage dependencies.

To deal with this issue, we chose to define aliases to dvc run commands using a Makefile because they are quite long and it centralizes them. We also use the option --ignore-build-cache to force re-running stages even if dependencies are up to date.

It allows us to keep control of what’s been updated or not. Additionally we can still take advantage of the dvc status command to figure out the current state of the pipeline when some stages have been updated.

Example of Makefile for the VAT auto-detection project

In other words, we chose not to use the top-level abstraction provided by the command dvc repro but rather to work one level below in applying per-stage commands.

An extra pinch of DVC features

One cool aspect of having all these data versioned alongside with the code is that you can get back in any state of the project.

git checkout old_state

dvc checkout

For storage and collaboration purposes, we configured aws s3 as a remote dvc storage. Then we synchronize the remote using DVC’s push and pull commands.

Cache isn’t cleared automatically. When its footprints gets too wide we use dvc gc to empty it. Make sure to run a dvc push beforehand to prevent any data loss.

Packaging as a library

Once we reach the performance target, we skip to packaging our codebase as a library. For a smooth integration to our production system we use setuptools .

A word about our production system

A micro-service comprises a synchronous endpoint receiving documents which dispatches jobs to an asynchronous worker. These jobs then import the VAT detection library to perform their tasks.

The library needs to be installed via pip. We add the following to our micro-service’s requirements.txt file.

git+https://{host}/vat_detection@{reference}#egg=vat_detection{version} # host is the name of the git remote server

# reference can be a git hash or a tag or a branch name

# version is the version of the package as defined in setup.py

However assets are not versioned with git: they will not be embedded while downloading the package. Here, assets are outputs of dvc stages, like a trained model of some kind of learning algorithm.

The build will still succeed but as soon we instantiate an object depending on the missing assets, we will face a legitimate FileNotFoundError .

To bypass this issue, we analyzed the way pip install works, it successively:

Downloads the package in a temporary directory Builds the package Installs the package in the current environment

We need the assets to be downloaded before the package is build. We chose to wrap setup.py build_py to first pull dvc assets as follows:

This is a bit hacky as it requires dvc to be installed (and dvc remote to be reachable) before building the vat_detection package. Still, it works well for our use case.

Takeaways

DVC brought versioning for inputs, intermediate files and algorithm models to the VAT auto-detection project and this drastically increased our productivity.

Moreover, it also force us to work with a clean framework to manage data in our projects. Indeed DVC stages offer an effortless way to split a project into atomic steps.

Finally, we no longer need to think about how to store our data for collaboration purposes as we just have to define a remote storage url in the dvc configuration file.