The world is rushing to embrace Docker containers as the new, easy way to package applications. As cool as it is to package applications with containers, as I have written before, the biggest challenge companies face is actually deploying their Docker containers – especially at scale.

Well, nothing spells scale in computing quite like CERN, the particle and nuclear physics research institute outside Geneva, Switzerland. It’s the home of the Large Hadron Collider and some 8,000 scientists from 500 universities. CERN and Mesosphere, whose pioneering concept of a data center operating system (DCOS) I wrote about in 2014, are working on a clever approach to solving the problem of scaling Docker containers in production.

I recently spoke with Mesosphere co-founder and Apache Mesos creator, Benjamin Hindman to learn more.

ReadWrite: What’s so challenging about storing and shipping Docker containers in production?

Benjamin Hindman: One of the beauties of containers as a way to package applications is that you don’t need to include an operating system, unlike with a virtual machine image, which means you can keep them small. The smaller the image, the less that you need to store and the less that you need to send around the network.

But it’s easy for container images to get huge; as you include all the dependent libraries and supporting files into the container’s file system the containers can get very big. And sometimes without even knowing it you end up adding things you don’t actually need.

Docker uses what it calls layers to help reuse parts of the file system between containers. Using layers can help the containers stay small, assuming all of the containers properly build on top of preexisting layers. But accomplishing this takes diligence, and from what I’ve seen in practice, this rarely is the case.

RW: Explain to me why Docker’s layers don’t work in production.

BH: From what I’ve seen in practice it’s pretty rare that developers diligently build on top of each other’s layers. In fact, it’s almost too easy to create a layer that diverges in such a way that another developer won’t want to build on top of your layer because that will actually bring in unnecessary stuff for your application.

The consequences of this can be pretty severe though.

For example, consider two cases, one where everyone ends up adding some library to their container image independently and another where they build on top of a layer that includes the library. In the world where the library is contained within the layer you’ll only have to download the layer the first time you launch a container which uses that layer, and all subsequent containers will get to reuse that layer.

In the world where each container image includes the library independently you’ll have to re-download the bits for the library every single time. This can be extremely wasteful, both on repository storage and the network. Repository sizes explode, choking network traffic with gigabytes of Docker downloads and storage requirements go through the roof.

At Mesosphere we see customers struggling with this problem in production. We tested a number of alternative solutions.

One approach was basically doing something where you pushed containers out to a few nodes and then they self-propagated using peer-to-peer technologies. That led to the insight that we should really just be looking inside the container image and shipping only the data that we haven’t previously shipped in the past. That is to say, focus on and address the content that we need within the container rather than focusing on layers which contain the content.

RW: How did integrating CVMFS from CERN come about to solve this problem?

BH: I was in Switzerland giving a talk at CERN and I got to meet with some of the team that had built CernVM-FS (CVMFS), a technology originally developed by CERN back in 2008. At the time, CERN was looking into hardware virtualization in a similar way that people are trying out containers today — how best to deploy applications. Instead of creating images or packages, CERN wanted to use a globally distributed file system. This would allow scientists to install their software once on a web server, and then access it from anywhere in the world. When I was in Geneva their team gave me a demo of CVMFS and I could immediately see that it was a perfect match for containers and would solve our problem.

CVMFS is perfect for propagating containers because it uses a combination of extensive indexing, de-duplication, caching and geographic distribution to minimize the number of components associated with individual downloads, and it’s all automated. This significantly reduces the amount of duplicate data that needs to be transferred and greatly speeds up the transfer of files that share data.

We realized that if we integrated CVMFS with Apache Mesos and the Mesosphere DCOS we could massively reduce the redundant data transfers and make container distribution very fast. That was our ah-hah moment!

RW: How does Apache Mesos and the Mesosphere DCOS deal with containers?

BH: Mesos and the DCOS rely on what we call “containerizers.” Containerizers are responsible for isolating running tasks from one another, as well as for limiting the resources (such as CPU, memory, disk and network) available to each task. A containerizer also provides a runtime environment for the task which we call a container. A container itself is not a primitive or first class object from Linux, it’s more of an abstract thing using control groups (cgroups) and namespaces. The Mesos containerizer supports all the different image formats that exist today, including Docker and appc.

RW: How far do you think this integration of Apache Mesos and CVMFS can scale?

BH: Theoretically it should scale to millions of containers. We’re testing now. The good news is that we already know that Mesos and DCOS can scale and that CVMFS can scale.

And the actual integration work was straightforward. The way it works is that instead of downloading the entire image up front, our integration uses the CVMFS client to mount the remote image root directory locally. It takes as its input the name of the CVMFS repository (which internally is mapped to the URL of the CVMFS server) as well as a path within the repository that needs to be used as container image root.

So now you can have multiple container images published within the same CVMFS repository. From the point of view of the containerizer, nothing changes. It is still dealing with the local directory that contains the image directory tree, on top of which it needs to start the container.

The big advantage, however, is that the fine-grained deduplication of CVMFS (based on files or chunks rather than layers with Docker) means we now can start a container without actually having to download the entire image. Once the container starts, CVMFS downloads the files necessary to run the task on the fly. Because CVMFS uses content addressable storage, we never need to download the same file twice. The end result is a much more efficient way to deploy Docker containers at massive scale without blowing up storage capacity and choking network bandwidth.