1. Objective

We will discuss the Comparison between Hadoop 2.x vs Hadoop 3.x. What are the new features added in Hadoop version 3, is Hadoop 2 programs compatible in Hadoop 3, what are the difference between Hadoop 2 and Hadoop 3? We hope that this Feature wise difference between Hadoop 2 and Hadoop 3. will help you to answer the above questions.

2. Feature-wise Comparison Between Hadoop 2.x vs Hadoop 3.x

This section will let you know the Top 22 differences between Hadoop 2.x vs Hadoop 3.x. Let us now discuss each feature one by one-

i. License

Hadoop 2.x – Apache 2.0, Open Source

Hadoop 3.x – Apache 2.0, Open Source

ii. Minimum supported version of Java

Hadoop 2.x – Minimum supported version of java is java 7.

Hadoop 3.x – Minimum supported version of java is java 8

iii. Fault Tolerance

Hadoop 2.x – Fault tolerance can be handled by replication (which is wastage of space).

Hadoop 3.x – Fault tolerance can be handled by Erasure coding.

iv. Data Balancing

Hadoop 2.x – For data, balancing uses HDFS balancer.

Hadoop 3.x – For data, balancing uses Intra-data node balancer, which is invoked via the HDFS disk balancer CLI.

v. Storage Scheme

Hadoop 2.x – Uses 3X replication scheme

Hadoop 3.x – Support for erasure encoding in HDFS.

vi. Storage Overhead

Hadoop 2.x – HDFS has 200% overhead in storage space.

Hadoop 3.x – Storage overhead is only 50%.

vii. Storage Overhead Example

Hadoop 2.x – If there is 6 block so there will be 18 blocks occupied the space because of the replication scheme.

Hadoop 3.x – If there is 6 block so there will be 9 blocks occupied the space 6 block and 3 for parity.

viii. YARN Timeline Service

Hadoop 2.x – Uses an old timeline service which has scalability issues.

Hadoop 3.x – Improve the timeline service v2 and improves the scalability and reliability of timeline service.

ix. Default Ports Range

Hadoop 2.x – In Hadoop 2.0 some default ports are Linux ephemeral port range. So at the time of startup, they will fail to bind.

Hadoop 3.x – But in Hadoop 3.0 these ports have been moved out of the ephemeral range.

x. Tools

Hadoop 2.x – Uses Hive, pig, Tez, Hama, Giraph and other Hadoop tools.

Hadoop 3.x – Hive, pig, Tez, Hama, Giraph and other Hadoop tools are available.

xi. Compatible File System

Hadoop 2.x – HDFS (Default FS), FTP File system: This stores all its data on remotely accessible FTP servers. Amazon S3 (Simple Storage Service) file system Windows Azure Storage Blobs (WASB) file system.

Hadoop 3.x – It supports all the previous one as well as Microsoft Azure Data Lake filesystem.

xii. Datanode Resources

Hadoop 2.x – Datanode resource is not dedicated for the MapReduce we can use it for other application.

Hadoop 3.x – Here also data node resources can be used for other Applications too.

xiii. MR API Compatibility

Hadoop 2.x – MR API compatible with Hadoop 1.x program to execute on Hadoop 2.X

Hadoop 3.x – Here also MR API is compatible with running Hadoop 1.x programs to execute on Hadoop 3.X

xiv. Support for Microsoft Windows

Hadoop 2.x – It can be deployed on windows.

Hadoop 3.x – It also supports for Microsoft windows.

xv. Slots/Container

Hadoop 2.x – Hadoop 1 works on the concept of slots but Hadoop 2.X works on the concept of the container. Through in the container, we can run the generic task.

Hadoop 3.x – It also works on the concept of a container.

xvi. Single Point of Failure

Hadoop 2.x – Has Features to overcome SPOF so whenever Namenode fails it recovers automatically.

Hadoop 3.x – Has Feature to overcome SPOF so whenever Namenode fails it recovers automatically no needs manual intervention to overcome it.

xvii. HDFS Federation

Hadoop 2.x – In Hadoop 1.0, only single NameNode to manage all Namespace but in Hadoop 2.0, multiple NameNode for multiple Namespace.

Hadoop 3.x – Hadoop 3.x also have multiple Namenode for multiple namespaces.

xviii. Scalability

Hadoop 2.x – We can scale up to 10,000 Nodes per cluster.

Hadoop 3.x – Better scalability. we can scale more than 10,000 nodes per cluster.

xix. Faster Access to Data

Hadoop 2.x – Due to data Node caching we can fast access the data.

Hadoop 3.x – Here also through Datanode caching we can fast access the data.

xx. HDFS Snapshot

Hadoop 2.x – Hadoop 2 adds the support for a snapshot. It provides disaster recovery and protection for user error.

Hadoop 3.x – Hadoop 2 also support for the snapshot feature.

xxi. Platform

Hadoop 2.x – Can serve as a platform for a wide variety of data analytics possible to run event processing, streaming, and real-time operations.

Hadoop 3.x – Here also it is possible to run event processing, streaming and real-time operation on the top of YARN.

xxii. Cluster Resource Management

Hadoop 2.x – For cluster resource Management it uses YARN. It improves scalability, high availability, Multi-tenancy.

Hadoop 3.x – For a cluster, resource Management Uses YARN, with all the features.