The term Big Data has created a lot of hype already. Chief managers know that their marketing strategies are most likely to yield successful results when planned around big data analytics. For simple reasons, use of big data analytics helps improve business intelligence, boost lead generation efforts, provide personalized experiences to customers and turn them into loyal ones. However, it’s a challenging task to make sense of vast amounts of data that exists in multi-structured formats like images, videos, weblogs, sensor data, etc.

In order to store, process and analyze terabytes and even petabytes of such information, one needs to put into use big data frameworks. In this blog, I am offering an insight and analogy between two such very popular big data technologies - Apache Hadoop and Apache Spark.

Let’s First Understand What Hadoop and Spark are?

Hadoop: Hadoop, an Apache.org. Project, was the first big data framework to become popular in the open source community. Being both a software library and a big data framework, Hadoop paves the way for distributed storage and processing of large datasets across computer clusters using simple programming models. Hadoop is a framework composed of modules that allow automated handling of common hardware failure occurrences.

The four primary modules that comprise Hadoop’s core are:

Hadoop Common: The collection of common utilities and libraries that support other Hadoop modules.

Hadoop Distributed File System(HDFS): The primary storage system used by Hadoop applications.

Hadoop MapReduce: A software framework to process huge piles of data.

Hadoop YARN (Yet Another Resource Negotiator): A cluster management technology.

Apart from Hadoop’s core modules, there are several others in existence as well, including Cassandra, Hive, Pig, Ambari, Avro, Oozie, Sqoop and Flume. These modules are also well capable of working with big data applications and processing large data sets.

The main motive behind designing Hadoop was to look through billions of pages and collect their information into a database. That gave birth to Hadoop’s HDFS and its distributed processing engine, MapReduce. Hadoop is a great help for companies that have no effective solution to deal with large and complex datasets in a reasonable amount of time.

Apache Spark: Spark, also an open-source framework for performing general data analytics on distributed computing cluster, was originally designed at the University of California, and later donated to the Apache Software Foundation. Spark’s real-time data processing capability provides it a substantial lead over Hadoop’s MapReduce.

Although Spark can run as a standalone cluster, it can also run in existing Hadoop clusters through YARN, which makes it versatile. An interesting point to note here is that Spark is devoid of its own distributed filesystem. So, for distributed storage, it has to either use HDFS or other alternatives, such as MapR File System, Cassandra, OpenStack Swift, Amazon S3, Kudu, etc. Spark also has an ecosystem of libraries that opens doors to machine learning, interactive queries, etc.

Now that we have caught a glimpse of Hadoop and Spark, it’s time to talk about different types of data processing they perform.

Read more at: Apache Hadoop vs Apache Spark: Two Popular Big Data Frameworks Compared















