In this blog post I will walk you through three ways of settings up Hadoop. Each method will have its pros and cons and in some cases, somewhat narrow use cases.

Installing Hadoop via Debian Packages

If you are working on building Map Reduce jobs packaged as JAR files and want to run them locally then installing a single-node setup on your local system can be a quick way of getting everything going.

Ubuntu out of the box does not have a Hadoop package that can be installed via apt but there is a PPA archive with various Hadoop tools and boilerplate configurations bundled together. As of this writing only Ubuntu 11, 12 and 14 are supported. The installation below was run on Ubuntu 14.04 LTS which will be supported until April 2019. To contrast, Ubuntu 15.10's support will end in July 2016.

To start, add the repository for hadoop-ubuntu :

$ sudo add-apt-repository ppa:hadoop-ubuntu/stable

Hadoop Stable packages These packages are based on Apache Bigtop with appropriate patches to enable native integration on Ubuntu Oneiric onwards and for ARM based archictectures. Please report bugs here - https://bugs.launchpad.net/hadoop-ubuntu-packages/+filebug More info: https://launchpad.net/~hadoop-ubuntu/+archive/ubuntu/stable Press [ENTER] to continue or ctrl-c to cancel adding it gpg: keyring `/tmp/tmpe8tsd3jf/secring.gpg' created gpg: keyring `/tmp/tmpe8tsd3jf/pubring.gpg' created gpg: requesting key 84FBAFF0 from hkp server keyserver.ubuntu.com gpg: /tmp/tmpe8tsd3jf/trustdb.gpg: trustdb created gpg: key 84FBAFF0: public key "Launchpad PPA for Hadoop Ubuntu Packagers" imported gpg: Total number processed: 1 gpg: imported: 1 (RSA: 1) OK

Retrieve the new packages that are associated with hadoop-ubuntu and then install OpenJDK 7 development kit and the Hadoop package:

$ sudo apt update $ sudo apt install \ openjdk-7-jdk \ hadoop

Adjusting the Boilerplate Configuration There are four configuration packages that will need to be adjusted before you can format your name node. Three of these are XML files and the remaining file is a shell script with Hadoop's environment variables. These will be used when you run any Hadoop jobs. Core Site Configuration: $ sudo vi /etc/hadoop/conf/core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name> hadoop.tmp.dir </name> <value> /tmp </value> <description> A base for other temporary directories. </description> </property> <property> <name> fs.default.name </name> <value> hdfs://localhost:54310 </value> <description> The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property> </configuration> Map Reduce Site Configuration: $ sudo vi /etc/hadoop/conf/mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name> mapred.job.tracker </name> <value> localhost:54311 </value> <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration> HDFS Site Configuration: $ sudo vi /etc/hadoop/conf/hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name> dfs.replication </name> <value> 1 </value> <description> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration> Hadoop Environment Configuration: $ sudo vi /etc/hadoop/conf/hadoop-env.sh export JAVA_HOME = /usr export HADOOP_NAMENODE_OPTS = "-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS " export HADOOP_SECONDARYNAMENODE_OPTS = "-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS " export HADOOP_DATANODE_OPTS = "-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS " export HADOOP_BALANCER_OPTS = "-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS " export HADOOP_JOBTRACKER_OPTS = "-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS "

Format the Hadoop File System The following will create the storage system Hadoop will use to read and write files. If this is done on more than one machine the storage system can be referred to as distributed. $ sudo hadoop namenode -format ... namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.0.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012 ************************************************************/ ... util.GSet: VM type = 64-bit ... util.GSet: 2% max memory = 19.33375 MB ... util.GSet: capacity = 2^21 = 2097152 entries ... util.GSet: recommended=2097152, actual=2097152 ... namenode.FSNamesystem: fsOwner=root ... namenode.FSNamesystem: supergroup=supergroup ... namenode.FSNamesystem: isPermissionEnabled=true ... namenode.FSNamesystem: dfs.block.invalidate.limit=100 ... namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) ... namenode.NameNode: Caching file names occuring more than 10 times ... common.Storage: Image file of size 110 saved in 0 seconds. ... common.Storage: Storage directory /tmp/dfs/name has been successfully formatted. ... namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/ The key piece of information you're looking for is: Storage directory /tmp/dfs/name has been successfully formatted. If you don't see this then something has gone wrong.

Setting up Hadoop's SSH access Even though this is a single-node instance Hadoop expects to SSH into each machine, whether it's logical or physical, to conduct its operations. When doing so it'll need full permissions and expects to be doing so via the root user. The following confirms that sshd allows root to shell into the machine without a password. $ grep PermitRootLogin /etc/ssh/sshd_config PermitRootLogin without-password Then, assuming root doesn't yet have a private and public key pair, a key pair needs to be generated. The public key then can be added to the authorized keys list so the root can ssh into the server using its own account. $ sudo su root@ubuntu:~# ssh-keygen root@ubuntu:~# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Launching Hadoop's Processes With the above all in place you can run a shell script that will launch the various services needed for a functioning Hadoop instance. $ sudo /usr/lib/hadoop/bin/start-all.sh starting namenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-namenode-ubuntu.out localhost: starting datanode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-datanode-ubuntu.out localhost: starting secondarynamenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-secondarynamenode-ubuntu.out starting jobtracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-jobtracker-ubuntu.out localhost: starting tasktracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-tasktracker-ubuntu.out If you run the JVM process status tool you should see the various nodes and trackers up and running: $ sudo jps 19892 TaskTracker 19724 JobTracker 19295 NameNode 19456 DataNode 19627 SecondaryNameNode 20035 Jps