The Hadoop community is working on patches that will bring the popular app-containerization technology Docker into the data management system, and independent benchmarks are showing the tech has a huge speedup over traditional virtualization approaches.

Docker is an open source Linux containerization technology that uses underlying kernel elements like namespaces, lxc, and cgroups to let an admin run multiple apps with all their dependencies in secure sandboxes on the same underlying Linux OS, making it an attractive alternative to typical virtualization, which bundles a copy of the OS with each app.

In a set of benchmarks an IBM employee released on Thursday, the company showed that Docker containerization has some huge advantages over the KVM hypervisor from a performance perspective.

Alongside this, El Reg has discovered some fascinating work by the Hadoop community to bring the tech into the eponymous data analysis and management engine.

Combined, these crumbs of news add more grist to the idea that Docker could become an eventual replacement for traditional virtualization approaches, granting organizations big benefits from an open source tech.

To start with, benchmarks conducted by IBM show that Docker has a number of performance advantages over the KVM hypervisor when running on the open source cloud infrastructure tool OpenStack.

In an informative post published on Thursday, IBM chap Boden Russell goes into further details about the results.

"From an OpenStack Cloudy operational time perspective (boot, reboot, delete, snapshot, etc.) docker LXC outperformed KVM ranging from 1.09x (delete) to 49x (reboot)," Russell wrote. "Based on the compute node resource usage metrics during the serial VM packing test: Docker LXC CPU growth is approximately 26x lower than KVM. On this surface this indicates a 26x density potential increase from a CPU point of view using docker LXC vs a traditional hypervisor. Docker LXC memory growth is approximately 3x lower than KVM. On the surface this indicates a 3x density potential increase from a memory point of view using docker LXC vs a traditional hypervisor."

Impressive stuff, indeed.

Altiscale wants to spin a Docker YARN

Not only does Docker have desirable resource-usage characteristics, but the way it allows devs to package up applications has attracted attention from the open source Hadoop community.

Recently we learned that some people are diligently working to add Docker support into a crucial component of Apache Hadoop 2.0 named YARN, with the goal of increasing the usefuleness of both techs.

YARN was introduced in version two of Apache Hadoop. It lets the software run multiple applications within Hadoop rather than purely MapReduce jobs. Thanks to this, YARN is helping to transform Hadoop from a batch processing and storage system into a more general tool for manipulating and storing data.

By combining YARN with Docker, the community hopes it can make it trivial for developers to package up an application in a Docker container, then sling it onto the YARN tech as part of a larger Hadoop installation.

Altiscale, the company behind the code contributions that make this possible, was kind enough to answer some of our questions about why this could be useful.

"As a company building a Hadoop as a Service platform, we are particularly interested in YARN as it allows Hadoop to move beyond map-reduce to a much more diverse variety of applications," explained the company's chief executive Raymie Stata to El Reg via email. "One of the key components of YARN that make this possible are containers. The existing YARN container implementation does not adequately provide all the types of isolation required to address a scenario we are noticing with our larger customers – multiple, independent groups in the same organization with different software requirements."

By adding in Docker support, Altiscale hopes it can flatten some of the barriers that lie between enterprise developers and a greater use of Hadoop.

"A common struggle for users is software dependency management," Stata explained. "Docker provides an intriguing approach to solving that problem by allowing users to upload prepackaged environments (or images) into repositories which can then easily be downloaded and run in isolation. For example, there are public repositories in the Docker community called Docker registries which provide a variety of language environments such as Java and Ruby. There is also support for private repositories where containers with more specialized environments can be placed."

Other members of the Hadoop community are keen on the addition of Docker as well.

"Where Docker makes perfect sense for YARN is that we can use Docker Images to fully describe the *entire* unix filesystem image for any YARN container," explained Arun Murthy, a founder and architect at Hortonworks, to El Reg in an email.

"This way, instead of forcing the user to deal with individual files or binaries (as today) we can allow the application to package up the *entire* Unix filesystem image it needs as Docker image and then get perfect predictability, from an environment perspective, at runtime. This is where Docker has the most amount of interest to the YARN/Hadoop community - particularly for people packaging up complex applications which need their own version of perl, python, java, libc etc. etc. ... that is hard to manage on YARN currently."

The addition of Docker to YARN looks like a potentially useful tool and is another example of the enthusiasm with which Silicon Valley has adopted the young open source technology.

This follows Red Hat announcing broad support for Docker in its eponymous Linux distribution, and launching a project named "Atomic" built around the tech.

Amazon also recently added Docker support to its "Elastic Beanstalk" platform-as-a-service cloud.

These moves back up an earlier assertion by a Red Hat employee that: "Docker as a packaging tool for shipping software may be a game changer". ®