Here I give: (1) a very brief overview of Apache Hadoop, (2) show how to set up a simple single node cluster, and (3) list some of the items that a system administrator should study so they can work as an Hadoop Systems Administrator.

The Hadoop Distributed File System

Hadoop is an open source file system that enables you to store files across multiple machines. The idea is it is faster and more feasible to process lots of data in parallel than it is to load all of it onto one machine and process it there. Much Hadoop data is too large to put on one machine anyway, which is one reason they call it ‘big data’.

Hadoop can store any kind of data. There are various tools to apply a schema to logs or other types of files in Hadoop so they can be accessed by different programming languages to give them some structure. This is why Hadoop data is often called ‘unstructured’. This is sort of misleading since you can in fact define a schema around the text files put into Hadoop, although it can store images, video, and audio too.

A Hadoop file can span multiple virtual machines, which is why it is called the ‘Hadoop Distributed File System’. The Hadoop files are stored in blocks on virtual machines using their attached storage as opposed to some kind of storage array. The idea is to use low-cost hardware, which is of course one of the main principles behind cloud computing.

Hadoop has replication built into it. By default it makes three copies of each file. That is why you can use it to store data instead of buying more expensive storage arrays that use different RAID replication schemes to replicate data. Plus each storage node is a computer, so it can participate in querying and writing the data.

Hadoop works in a master-slave mode. The Hadoop namenode keeps track of what data is located on which slave (data) node. The namenode should be stored on a server with advanced storage, since there is only one copy of that data, although a second can be kept up-to-date with Hadoop replication.

Further definition of Hadoop and object storage

Hadoop is not a database, although there are some open source Hadoop projects to add update, read, and random access to Hadoop, thus giving it database-like capabilities. Hadoop does not give record level access to lines or rows in a file, like a database. Instead it is a storage system designed to store individual files. You can think of Hadoop as nothing more than the shared file folders in a LAN where you can store any type of file, expect it spams multiple machines.

The alternative to Hadoop storage for big data is object storage, like Amazon S3 or Cleversafe.

With object storage, each file has a unique object ID and no file path. Streaming video is one example of data that is stored in this format (Netflix uses Amazon S3).

The other difference is metadata. The Hadoop file metadata is fixed. The most you can say about a Hadoop file with metadata is things like:

date created;

file name;

file size;

file suffix (implies type of file, like .PDF);

file path (you can use this to further describe the file, e.g. "\customers\important_customers\cash_balance.dat); and

date last modified.

In other words, you cannot change any of the metadata with a Hadoop file, except you could try to use the file suffix, file path, and file name to somehow describe what the file is. But that is limited.

With an object file system you have these attributes, as well as any of your own metadata that you choose to add, such as:

codec encoding scheme;

encryption type;

processing instructions;

date to archive; and

anything else.

You write data to an object file system using web services and, most commonly, REST web services - GET, DELETE, PUT - which is the same approach as with HTTP web pages. Notice there is no UPDATE.

Batch system

Hadoop is a batch system, although Hbase is a database that provides real-time access to Hadoop data.

You query Apache data files by using the MapReduce programming technique. This requires knowledge of Java, so there are other tools that the administrator and other persons can use. These tools, like Hive and Pig, are front-ends that create MapReduce jobs that crawl across the nodes to retrieve data.

MapReduce creates a new set of data from existing data. For instance, if you have a file with all employees and a file with all members of the football team, a MapReduce job could create a new file with all the members of the football team who live in London. So you can say that MapReduce processes queries. The ‘map’ part gets the data and the ‘reduce’ makes a new set. Part of the reduction step is to eliminate duplicated records. It is also called ‘Map’because it uses the hashmap approach of storing data which is {<key>, <value>} pairs.

Hadoop is still so new that expect the new versions to differ from previous versions. For example, MapReduce version 2 is called Yarn. They changed the name because they completely changed the design.

Why is Hadoop popular?

The thing that makes Hadoop so useful is that you can store data there that has no structure (big data). In other words you do not have to format it to load it into a row-and-column format as you would with an Oracle Database. Instead you can create your own schema to create some structure around the data using tools like Apache Hive.

Install and start Hadoop

Here is a very simple installation of Hadoop version 1.2.0.

In this example I create a single node HDFS (Hadoop Distributed File System). I will set it up to run as root.

First, Download Hadoop from the Apache Hadoop website and unzip it to any folder. In my case I put it here: /root/hadoop-1.2.0.

The default is a single node cluster as you can see when you look at the masters and slaves files:

root@ubuntu12040-001:~/hadoop-1.2.0/conf# cat masters localhost root@ubuntu12040-001:~/hadoop-1.2.0/conf# cat slaves localhost

Hadoop requires ssh to work. You need to create an RSA key on the machine where you install Hadoop. Alternatively you can use your existing ssh key.

If you try to start Hadoop without an RSA key you get this message:

root@localhost's password: localhost: Permission denied, please try again. root@ubuntu12040-001:~/hadoop-1.2.0/bin# localhost: Permission denied (publickey, password).

Here I generate an RSA key for SSH. This one has no passphrase, which is a security violation, but this is OK for a lab environment. (It saves me from typing the passphrase when I want to execute Hadoop functions.)

ssh-keygen -t rsa -P ""

Then cat that to the end of file:

$HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now make sure you can ssh to localhost without being prompted for a password:

ssh localhost

You need a Java SDK. Enter the location of that in:

conf/hadoop-env.sh

Where is says:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

You then need to format the Hadoop file system. (This command only works if you enter capital ‘Y’ when prompted.)

bin/hadoop namenode -format

Then you start it up:

bin/start-all.sh

Fix any error message you see. (I did not see any.)

Hadoop config files

The main configuration file is conf/core-site.com. The core-site.xml from this simple installation is shown below.

Note that the HDFS has an address of localhost: 8020 (walker is the name of my machine), which is used to refer to the HDFS, using a similar syntax to the one you would use to access a website.

<configuration> <property> <name>fs.default.name</name> <value>hdfs://walker:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> </property> </configuration>

The system administrator’s role for Hadoop

Here are the main tasks you would need to study to become a Hadoop administrator:

Configure the job trackers and the task trackers.

Install and use Hbase.

Configure Fairscheduler to control who is running what job when.

Manage the cluster.

Plan Hadoop cluster hardware.

Learn something about Pig, Hive, and Impala. The Hadoop administrator would probably be expected to know these basic Hadoop query languages.

Configure HDFS for high availability.

How to load data into HDFS.

Hadoop Security (Apache Use).

So that is a basic overview of Hadoop; how to set it up, and some of the tasks that an administrator would likely be required to perform.