While Hadoop is becoming more and more mainstream, many development leaders want to speed up and reduce errors in their development and deployment processes (i.e. devops) by using platforms like PaaS and lightweight runtime containers. One of the most interesting, recent stats in the devops arena is that companies with high performing devops processes can ship code 30x more frequently and complete deployment processes 8,000 times faster.

To this end, Docker is a new-but-rising lightweight virtualization solution (more precisely, a lightweight Linux isolation container). Basically, Docker allows you to package and configure a run-time and deploy it on Linux machines—it’s build-once-run-anywhere, isolated like a virtual machine, and runs faster and lighter than traditional VMs. Today, I will show you how two components of the Pivotal Big Data Suite—our Hadoop distribution, Pivotal HD, and our SQL interface on Hadoop, HAWQ—can be quickly and easily set up to run on a developer laptop with Docker.

With the Docker model, we can literally turn heavyweight app environments on and off like a light switch! The steps below typically take less than 30 minutes: 1) Download and Import the Docker Images, 2) Run the Docker Containers, 3) SSH in to the Environment to Start Pivotal HD, 4) Test Hadoop’s HDFS and MapReduce, 5) Start HAWQ—SQL on Hadoop, and 6) Test HAWQ.

If you would prefer to see a video, here is a demonstration of Hadoop on Docker.

Hadoop on Docker—Architecture

This diagram explains the overall deployment of Pivotal HD and HAWQ across several Docker containers. Basically, the workloads will run on a Hadoop master node (e.g. like namenode etc.) with some Hadoop nodes (e.g. datanode etc.) and a HAWQ master with two segment servers.

There are a few other components worth mentioning:

tar files – These are the Docker image files. In the future, we plan to upload these to Docker’s Repository so that you can pull them from Docker. Currently, you need to download a gzip-ed file from our repository.

Containers – These are the Docker containers that contain Pivotal Command Center (our Pivotal HD cluster orchestration tool) and the deployed Pivotal HD and HAWQ components. You will NOT have to install and deploy Pivotal HD. It is already built as part of the Docker files!

Pivotal HD and HAWQ components. You will NOT have to install and deploy Pivotal HD. It is already built as part of the Docker files! Other libraries – DNS and SSH servers are set up to work for the cluster.

That’s it. You don’t need any other files—the tar images contain everything you need to set this up on your own laptop or development environment.



Hadoop on Docker—Environments and Prerequisites

Currently, I run this entire environment on my development laptop, and the specs are below. It’s a decent set-up in terms of compute and memory:

Ubuntu 13.10(on Windows 7 using VirtualBox) 2 CPUs are allocated. Intel i5 2.60Ghz 10GB of memory is allocated

In addition, I run this Hadoop on Docker environment on an Amazon Web Services Ubuntu 13.10 64-bit m2.xlarge virtual machine. If you are using Amazon, make sure that your root directory has a plenty of space since /var/lib/docker will be used for image extraction. The AMI is ubuntu-saucy-13.10-amd64-server-20140212 .

There aren’t many other software prerequisites. Basically, you need Docker 0.9.1 installed, and the Docker team has a good set up document.

In theory, this Hadoop on Docker install should work on all Linux systems, but I only tested it on Ubuntu 13.10 64-bit. Also, make sure that you don’t run any containers. (e.g. docker ps command does not return any container IDs.) There are some hardcode values and limitations, which I will fix in future.



6 Simple Steps to start Hadoop with SQL on Docker

1. Download and Import the Hadoop on Docker Images

First, you are going to download the tar ball, extract, and start importing the image into Docker. Remember to check that md5 is the correct hash.

tar -xvf phd-docker.tar.gz

sudo su # Make sure you run docker command as root.

cd images

cat phd-master.tar | docker import - phd:master

cat phd-slave1.tar | docker import - phd:slave1

cat phd-slave2.tar | docker import - phd:slave2

2. Run the Hadoop on Docker Containers

Once you import images, we can run the Hadoop on Docker containers. This part of the process requires some parameters, but don’t be afraid. It should work in your environment without any problem.

# Set a variable

DOCKER_OPTION="--privileged --dns 172.17.0.2 --dns 8.8.8.8 -d -t -i"

# Start master container

docker run ${DOCKER_OPTION} -p 5443 --name phd-master -h master.mydomain.com phd:master bash -c "/tmp/phd-docker/bin/start_pcc.sh"

# Start slave containers

for x in {1..2} ; do docker run ${DOCKER_OPTION} --name phd-slave${x} -h slave${x}.mydomain.com phd:slave${x} /tmp/phd-docker/bin/start_slave.sh ; done

Wow, within a second or two, you have three3 nodes (VMs if you wish to call them) running on your machine. Pivotal Command Center, Pivotal HD, and HAWQ are all deployed.

Isn’t it lighting fast!?!?!?

3. SSH and Start the Pivotal Hadoop on Docker Cluster

Now we can log in (ssh) to the master container and start the Pivotal HD cluster.

ssh root@172.17.0.2 # Password: changeme

# Make sure all services are running. Wait a few moments if any of them are not running yet.

service commander status

# Login as PHD admin user

su - gpadmin

# Start the cluster

icm_client start -l test

4. Test HDFS and MapReduce for Hadoop on Docker

At this point, Pivotal Hadoop is running, and HAWQ is not started yet. Before we start HAWQ, we should test to see if Hadoop is running.

First, we go to the web-based user interface for the Hadoop status:

Go to http://172.17.0.2:50070/dfshealth.jsp – This shows the HDFS status. If you see 2 Live Nodes, it is a good sign! Your data nodes on the Docker containers are connected to your master container.

Go to http://172.17.0.2:8088/cluster – This shows the MapReduce status. You can check job status while you are trying mapreduce in the next section.

Now, we can run a simple word count test within MapReduce and check the web UI while the job is running.

# Simple ls command

hadoop fs -ls /

# Make an input directory

hadoop fs -mkdir /tmp/test_input

# Copy a text file

hadoop fs -copyFromLocal /usr/lib/gphd/hadoop/CHANGES.txt /tmp/test_input

# Run a wordcount MapReduce!

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /tmp/test_input /tmp/test_output

# Check the result

hadoop fs -cat /tmp/test_output/part*

5. Start HAWQ—SQL on Hadoop

Now that we have confirmed Hadoop is running fine on Docker, we can start the HAWQ cluster. In this example, hawq-master is running on the slave1 machine; so, remember to log into slave1.

ssh slave1 # which is hawq master

su - gpadmin

# Source the environment variables for HAWQ.

source /usr/lib/gphd/hawq/greenplum_path.sh

# SSH keys should be set among HAWQ cluster nodes.

echo -e "slave1nslave2" > HAWQ_HOSTS.txt

gpssh-exkeys -f HAWQ_HOSTS.txt # gpadmin's password is gpadmin

# Initialize HAWQ cluster.

/etc/init.d/hawq init

6. Test HAWQ—SQL on Hadoop

Since we can log in, HAWQ is running. Let’s test it.

# On slave1, login as gpadmin and source the environment variables if you haven’t so.

su - gpadmin

source /usr/lib/gphd/hawq/greenplum_path.sh

# Postgres shell

psql -p 5432

# Create a table

create table test1 (a int, b text);

# Insert a couple of records

insert into test1 values (1, 'text value1');

insert into test1 values (2, 'text value2');

# See it returns the rows.

select * from test1;

# Exit the shell and find how they are stored in HDFS.

hadoop fs -cat /hawq_data/gpseg*/*/*/* # This shows a raw hawq file on hdfs

Well done! Pivotal HD and HAWQ are running on your laptop within the Docker container.

Cleaning Up Your Mess

In my opinion, clean up is the beauty of container solutions like Docker. You can make any mess you like, and, then, you can just kill it—everything is gone. Of course, VMs are a good solution, but stop and start commands take much longer than with a container like Docker. Here is how to clean up your environment:

# docker ps finds all container IDs, and rm command kills them

docker ps -a -q |xargs docker rm -f

After running this, none of the services, web UI links, or commands should work. Similarly, you can no longer ssh to the nodes we’ve created. Everything is clean.

Thank you for reading and watching. I hope you enjoy!

For more information: