Docker for Data Science

Docker is a tool that simplifies the installation process for software engineers. Coming from a statistics background I used to care very little about how to install software and would occasionally spend a few days trying to resolve system configuration issues. Enter the god-send Docker almighty.

Think of Docker as a light virtual machine (I apologise to the Docker gurus for using that term). Generally someone writes a *Dockerfile* that builds a *Docker Image* which contains most of the tools and libraries that you need for a project. You can use this as a base and add any other dependencies that are required for your project. Its underlying philosophy is that if it works on my machine it will work on yours.

What’s in it for Data Scientists?

Time: The amount of time that you save on not installing packages in itself makes this framework worth it. Reproducible Research: I think of Docker as akin to setting the random number seed in a report. The same dependencies and versions of libraries that was used in your machine are used on the other person’s machine. This makes sure that the analysis that you are generating will run on any other analysts machine. Distribution: Not only are you distributing your code, but you are also distributing the environment in which that code was run.

How Does it Work?

Docker employs the concept of (reusable) layers. So whatever line that you write inside the Dockerfile is considered a layer. For example you would usually start with:

FROM ubuntu

RUN apt-get install python3

This Dockerfile would install python3 (as a layer) on top of the Ubuntu layer.

What you essentially do is for each project you write all the apt-get install , pip install etc. commands into your Dockerfile instead of executing it locally.

I recommend reading the tutorial on https://docs.docker.com/get-started/ to get started on Docker. The learning curve is minimal (2 days work at most) and the gains are enormous.

Dockerhub

Lastly Dockerhub deserves a special mention. Personally Dockerhub is what makes Docker truly powerful. It’s what github is to git, a open platform to share your Docker images. You can always construct a Docker image locally using docker build … but it is always good to push this image to Dockerhub so that the next person simply has to pull for personal use.

My Docker image for Machine Learning and data science is available here, along with its source file.

Concluding Thoughts

Personally I have started including a Dockerfile in most if not all of my github repo’s. Especially considering it means that I would never have to deal with installation issues.

Docker is one of the tools that as a software engineer (and now data scientists/ analysts) should have in their repertoire (with almost the same regard and respect as git). For a long time statisticians and Data Scientists have ignored the software aspect of data analysis. Considering how simple and intuitive it has become to use Docker there really is no excuse for not making it part of your software development pipeline.

Edit 1

If you are after a bit more substantial tutorial than the quick tips provided above see this video (jump to 4:30ish):

Edit 2 (A quick note on virtualenvs for python, packrat for R etc.):

Personally I have not used any of the other containerising tools, however it should be noted that Docker is independent of python and R, and goes beyond containerising applications for specific programming languages.