What Is Hadoop?

Hadoop is an open-source software framework that provides for processing of large data sets across clusters of computers using simple programming models. Hadoop is designed to scale up from single servers to thousands of machines.

Hadoop grew out of an open-source search engine called Nutch, developed by Doug Cutting and Mike Cafarella. Back in the early days of the Internet, the pair wanted to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be executed at the same time.

The distributed computing and processing portion of Nutch was eventually split off and named Hadoop (after Cutting’s son’s toy elephant). Yahoo released Hadoop as an open-source project in 2008, and today the Hadoop ecosystem is managed and maintained by the non-profit Apache Software Foundation (ASF), an international community of software developers and contributors.

Four core modules are included in the ASF’s basic framework:

Hadoop Common consists of the common utilities that support the other Hadoop modules.

consists of the common utilities that support the other Hadoop modules. Hadoop Distributed File System is a distributed file system that provides high-throughput access to application data.

is a distributed file system that provides high-throughput access to application data. Hadoop YARN is a framework for job scheduling and cluster resource management.

is a framework for job scheduling and cluster resource management. Hadoop MapReduce is a YARN-based system for parallel processing of large data sets.

Other elements of the Hadoop ecosystem of technologies include the following:

Pig is a high-level data-flow language and execution framework for parallel computation. It allows users to perform data extractions and transformations and basic analysis without having to write MapReduce programs.

is a high-level data-flow language and execution framework for parallel computation. It allows users to perform data extractions and transformations and basic analysis without having to write MapReduce programs. Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying. It was initially developed by Facebook.

is a data warehouse infrastructure that provides data summarization and ad hoc querying. It was initially developed by Facebook. HBase is a scalable, distributed database that supports structured data storage for large tables.

is a scalable, distributed database that supports structured data storage for large tables. Ambari is a web interface for provisioning, managing, and monitoring Hadoop services and components.

is a web interface for provisioning, managing, and monitoring Hadoop services and components. Cassandra is a scalable multi-master database system.

is a scalable multi-master database system. Chukwa is a data collection system for monitoring large distributed systems.

is a data collection system for monitoring large distributed systems. Oozie is a Hadoop job scheduler.

is a Hadoop job scheduler. Sqoop is a connection and transfer mechanism that moves data between Hadoop and relational databases.

is a connection and transfer mechanism that moves data between Hadoop and relational databases. Spark is an open-source cluster computing framework with in-memory analytics.

is an open-source cluster computing framework with in-memory analytics. Zookeeper is a high-performance coordination service for distributed applications.

Why Use Hadoop?

Hadoop has a lot to offer. SAS Institute identifies the following five benefits:

Computing power: Hadoop’s distributed computing model allows it to process huge amounts of data. The more nodes you use, the more processing power you have.

Hadoop’s distributed computing model allows it to process huge amounts of data. The more nodes you use, the more processing power you have. Flexibility: Hadoop stores data without requiring any preprocessing. Store data—even unstructured data such as text, images, and video—now; decide what to do with it later.

Hadoop stores data without requiring any preprocessing. Store data—even unstructured data such as text, images, and video—now; decide what to do with it later. Fault tolerance : Hadoop automatically stores multiple copies of all data, and if one node fails during data processing, jobs are redirected to other nodes and distributed computing continues.

Hadoop automatically stores multiple copies of all data, and if one node fails during data processing, jobs are redirected to other nodes and distributed computing continues. Low cost: The open-source framework is free, and data is stored on commodity hardware.

The open-source framework is free, and data is stored on commodity hardware. Scalability: You can easily grow your Hadoop system, simply by adding more nodes.

Although the development of Hadoop was motivated by the need to search millions of webpages and return relevant results, it today serves a variety of purposes. Hadoop’s low-cost storage makes it an appealing option for storing information that is not currently critical but that might be analyzed later. Hadoop storage is unencumbered by the schema-related constraints commonly found in SQL-based systems. Organizations are using Hadoop to stage large amounts of raw, sometimes unstructured data for loading into enterprise data warehouses. Many of Hadoop’s largest adopters use it for the real-time data analysis that enables web-based recommendation systems.

Who Uses Hadoop?

Apache Software Foundation maintains a list of companies using Hadoop, and usage goes beyond powering search engines or analyzing customer behavior to better target ads. Here’s how some big names are using Hadoop:

eBay uses Hadoop for search optimization.

uses Hadoop for search optimization. Hadoop powers Etsy ‘s Taste Test feature, which helps the site determine what products best suit a customer’s taste.

‘s Taste Test feature, which helps the site determine what products best suit a customer’s taste. At Facebook , Hadoop is used to store copies of internal log and dimension data sources and as a source for not only reporting and analytics but also machine learning.

, Hadoop is used to store copies of internal log and dimension data sources and as a source for not only reporting and analytics but also machine learning. Hadoop powers LinkedIn ‘s People You May Know feature.

‘s People You May Know feature. Hadoop enables Opower to suggest ways for consumers to save money on energy bills.

to suggest ways for consumers to save money on energy bills. To determine user preferences, Orbitz uses Hadoop to analyze every aspect of visitors’ sessions on its sites.

uses Hadoop to analyze every aspect of visitors’ sessions on its sites. Spotify uses Hadoop for content generation and for data aggregation, reporting, and analysis.

uses Hadoop for content generation and for data aggregation, reporting, and analysis. Twitter uses Hadoop to store and process tweets and log files.

uses Hadoop to store and process tweets and log files. Yahoo! has more than 40,000 computers running Hadoop to support research for Ad Systems and Web Search.

Interested in a different career? Check out our other bootcamp guides below:

Learn Hadoop

You’ve got lots of choices for online Hadoop training. Here are some options to consider: