Need help with your Big Data project or simply need data scientists, data engineers and visualizers to augment your existing team? Post your project in the Experfy Marketplace for on-demand help!

Need Big Data Analytics and Tech Training to Advance Your Career ? Browse all courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

For all those looking to harness the potential of big data, Hadoop is the platform of choice. This open source software framework enables processing of huge data sets by distributing them across commodity servers. Thus, it eliminates dependency on high-end hardware and makes the entire process economical for businesses to implement. All of the big data enterprises today use Apache Hadoop in some way or the other. To simplify working with Hadoop, enterprise versions like Cloudera, MapR and Hortonworks have sprung up.

In its original version, Hadoop was designed as a simple write-once storage infrastructure. But it has evolved through the years to expand beyond mere web indexing capacity. Based on Googles MapReduce model, Hadoop is designed to store and process large amounts and variety of data that may reside in multiple computer servers.

While Hadoops distributed file system (HDFS) helps break down all incoming data and store them across multiple nodes, the MapReduce component facilitates the simultaneous processing of data across multiple nodes.

Hadoop is by no means an out-of-the-box solution. In order to build a truly information- driven enterprise, where decisions are based on data and not guess works, the companies would require a data management solution that not only offers robust data governance, but also is easily manageable and seamlessly integrates with existing enterprise infrastructure.

The flexible, modular architecture of haddoop allows for adding new functionalities for the accomplishment of diverse Big Data tasks. A number of vendors have taken advantage of Hadoops open-ended framework and tweaked its codes to change or enhance its functionalities. In the process they have been able to fix some of the inherent drawbacks of Apache Hadoop. So far as Hadoop distribution is concerned, the three companies that really stand out in the completion are: Cloudera, MapR and Hortonworks.

Comparing top three Hadoop distributions: Cloudera vs Hortonworks vs MapR

Cloudera has been here for the longest time since the creation of Hadoop. Hortonworks came later. While Cloudera and Hortonworks are 100 percent open source, most versions of MapR come with proprietary modules. Each vendor/distribution has its unique strength and weaknesses, each have certain overlapping features as well. If you are looking to make the most of Hadoops immense data processing power, it makes sense in making a comparative study in the top three Hadoop distributions.

Cloudera

Cloudera Inc. was founded by big data geniuses from Facebook, Google, Oracle and Yahoo in 2008. It was the first company to develop and distribute Apache Hadoop-based software and still has the largest user base with most number of clients. Although the core of the distribution is based on Apache Hadoop, it also provides a proprietary Cloudera Management Suite to automate the installation process and provide other services to enhance convenience of users which include reducing deployment time, displaying real time nodes count, etc.

Cloudera Overview

Hortonworks

Hortonworks, founded in 2011, has quickly emerged as one of the leading vendors of Hadoop. The distribution provides open source platform based on Apache Hadoop for analysing, storing and managing big data. Hortonworks is the only commercial vendor to distribute complete open source Apache Hadoop without additional proprietary software. Hortonworks distribution HDP2.0 can be directly downloaded from their website free of cost and is easy to install. The engineers of Hortonworks are behind most of Hadoops recent innovations including Yarn, which is better than MapReduce in the sense that it will enable inclusion of more data processing frameworks.

Hortonworks Overview

MapR

In its standard, open source edition, Apache Hadoop software comes with a number of restrictions. Vendor distributions are aimed at overcoming the issues that the users typically encounter in the standard editions. Under the free Apache license, all the three distributions provide the users with the updates on core Hadoop software. But when it comes to handpicking any one of them, one should look at the additional value it is providing to the customers in terms of improving the reliability of the system (detecting and fixing bugs etc), providing technical assistance and expanding functionalities.

All three top Hadoop distributions, Cloudera, MapR and Hortonworks offer consulting, training, and technical assistance. But unlike its two rivals, Hortonworks distribution is claimed to be 100 percent open source. Cloudera incorporates an array of proprietary elements in its Enterprise 4.0 version, adding layers of administrative and management capabilities to the core Hadoop software.

Going a step further, MapR replaces HDFS component and instead uses its own proprietary file system, called MapRFS. MapRFS helps incorporate enterprise-grade features into Hadoop, enabling more efficient management of data, reliability and most importantly, ease of use. In other worlds, it is more production ready than its other two competitors.

Through a recent partnership with Canonical, the creator of Ubuntu operating system, MapR is offering Hadoop as a default component of Ubuntu operating system. Under the terms of the partnership, MapRs M3 Edition for Apache Hadoop will be integrated into Ubuntu operating system.

Upto its M3 edition, MapR is free, but the free version lacks some of its proprietary features namely, JobTracker HA, NameNode HA, NFS-HA, Mirroring, Snapshot and few more.

MapR Overview

Cloudera and Hortonworks: The Similarities

Cloudera as well as Hortonworks are both built upon the same core of Apache Hadoop. As such, they have more similarities than differences.

Both offer enterprise-ready Hadoop distributions. The distributions have stood the test of time as well as consumers, ensuring security and stability. Besides, they provide paid training and services to familiarize the newcomers treading the path of Big Data and Analytics.

Both have established communities that actively participate and help with the problems faced as well as demonstrations needed.

Both distributions have master-slave architecture.

Both have a shared-nothing computing framework.

Both support MapReduce as well as YARN.

Cloudera vs. Hortonworks: The Differences

That being said, the differences are the ones that play a deciding role of choosing one vendor over the other. Broadly, Cloudera and Hortonworks differ in the following aspects:

Cloudera has announced that its long term goal is to become an enterprise data hub, thus diminishing the need of data warehouse. Hortonworks, on the other hand, remains firmly a provider of Hadoop distro, and has partnered with data warehousing company Teradata.

While Cloudera CDH can be run on windows server, HDP is available as a native component on the windows server. A Windows-based Hadoop cluster can be deployed on Windows Azure through HDInsight Service.

Cloudera has a proprietary management software Cloudera Manager, SQL query handling interface Impala, as well as Cloudera Search for easy and real-time access of products. Hortonworks has no proprietary software, uses Ambari for management and Stinger for handling queries, and Apache Solr for searches of data.

Cloudera has a commercial license, while Hortonworks has open source license. Cloudera also allows the use of its open- source projects free of cost, but the package doesnt include the management suite Cloudera Manager or any other proprietary software.

Cloudera has a free 60-day trial, Hortonworks is completely free.

Cloudera has been the oldest player in the market, with more than 350 customers. But Hortonworks is fast catching up and has made more innovations in the Hadoop ecosystem in the recent past. Cloudera has several enterprise softwares overlaid on its open source distributions to aid the consumers, whereas Hortonworks strives to provide a framework comprising only of open source projects.

Table Source: Robert D. Schneider, Hadoop Buyers Guide, Ubantu, 2014

Cloudera vs. Hortonworks vs MapR: An Analyst Perspective at TUGG Boston

Wikibons Kelly on the Hadoop Horserace Between Cloudera, Hortonworks and MapR

Need help with your Big Data or Hadoop implementations or simply need data scientists to augment your existing team? Post your project in the Experfy Marketplace to solicit bids from vetted experts. Experfy has the worlds top data experts, who specialize in specific industry data and can ask the right questions of your data. You can also email [email protected] for more information.

Need Big Data Training? Browse all courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.