A Hadoop cluster can be defined as a special type of computational cluster designed to serve the purpose of storing and analysing huge amounts of data that is not structured, in a distributed computing environment.

Clusters like this can run on Hadoop’s open source distributed processing software on low cost computers, commodity computers to be specific

Hadoop Cluster Architecture:

Hadoop cluster has 3 components:

Client Master Slave

Client:

It is neither a master nor a slave, the work of a client is to submit the MapReduce jobs describing how the way data should be processed and then retrieve the data to know the response after completion of the Job.

Masters:

Master consists of 3 components, namely, NameNode, Secondary Node Name, and Job Tracker.

a. NameNode: NameNode does not store the actual files, it stores the meta information of the files. NameNode oversees the health of the DataNode and coordinates the access to the data.

b. JobTracker: JobTracker coordinates the parallel processing of data using MapReduce. To know more about JobTracker, please read the article All You Want to Know about MapReduce (The Heart of Hadoop).

c. Secondary NameNode: the job of Secondary NameNode is to contact the NameNode periodically to recall the metadata of the filesystem from the NameNode and saves it to a clean file folder and send it back to the NameNode. Essentially secondary Name Node does the job of house keeping. In case fo NameNode failure the saved meta data which is stored in the RAM of NameNode, can be rebuilt using the secondary Node.

Slaves:

Slave nodes are the majority of the machines in Hadoop Cluster and are responsible for storing the data and processing the computation.

Why use Hadoop Clusters:

Hadoop clusters are particularly known for boosting the speed of data analysis applications and their scalability. If at any point a cluster’s processing power is under stress by the growing volumes of data, it can be dealt by adding additional cluster nodes to increase throughput. Hadoop clusters have high resistance to failure because each block of data is copied onto other nodes ensuring that the data is not lost if a single node fails.