The volume of data generated every day is a mystery as it is increasing continually at a rapid rate. Although data is everywhere, the intelligence that we can glean from it matters more. These large volumes of data are what we call "Big Data." Organizations generate and gather huge volumes of data believing that this data might help them in advancing their products and improving their services. For example, a shop may have its customer information, stock details, purchase history, and website visits.

Often times, organizations store this data for regular business activities but fail to use it for further Analytics and Business Relationships. The data which is unanalyzed and left unused is what we call "Dark Data."

"Big Data is indeed a buzzword, but it is one that is frankly under-hyped," Ginni Rometty

The problem of untangling insights from data obtained from multiple sources has been around from the day when software applications were first created. This is normally time consuming and becomes obsolete for any form of decision making with the data moving so fast. The main aim of this blog series is to make effective use of big data and extend the use of business intelligence to decipher insights quickly and accurately from raw enterprise data on Alibaba Cloud.

What Is Big Data?

In the simplest terms, when the data you have is too large to be stored and analyzed by traditional databases and processing tools, then it is "Big Data." If you have heard about the 3Vs of big data, then it is simple to understand the underlying definition of big data.

Volume - Massive amount of data from various sources.

Variety - Variety of non-standard formats in which the data is generated both by human and machines.

Velocity - High Velocity at which the data is generated, stored, processed and retrieved.

Big Data and Analytics

Every individual and organization has data in one form or another, which they tried managing using spreadsheets, Word documents, and databases. With emerging technologies, the size and variety of data is increasing day by day, and it is no longer possible to analyze the data through traditional means.

The most important aspect of big data analytics is understanding your data. A good way to do this is to ask yourself these questions:

Where do I get this data?

What sort of data it is?

How do I bring this data into a big data environment?

Where do I store this received data?

How do I process and analyze the stored data?

What insights can I get out of it?

How can these insights transform my business?

Before exploring Alibaba Cloud's E-MapReduce, in this article we will answer the above questions to get started with big data.

Data Sources and Types

Data is typically generated when a user interact with a physical device, software, or system. These interactions can be classified into three types:

Transactional Data - The most important data to be considered. It is the data recorded by huge Retailers and B2B companies on a daily basis. It is collected based on every event that occurs, for example, number of products, products purchased, stock modified, customer information, distributors details, and lots more.

- The most important data to be considered. It is the data recorded by huge Retailers and B2B companies on a daily basis. It is collected based on every event that occurs, for example, number of products, products purchased, stock modified, customer information, distributors details, and lots more. Social Data - Common data or public data which can provide remarkable insights to companies. For example, a customer may tweet about a product, or like and comment about a purchase. This can help companies predict consumer behavior, their purchasing patterns, and sentiments, which is typically a kind of CRM data.

- Common data or public data which can provide remarkable insights to companies. For example, a customer may tweet about a product, or like and comment about a purchase. This can help companies predict consumer behavior, their purchasing patterns, and sentiments, which is typically a kind of CRM data. Machine Data - This is one major source of real-time data where we get data from electronic devices, such as sensors, machines, and even web logs.

For most enterprises, data can be categorized into the following types.

Structured Data - When you are able to place data in a relational database with a schema enforced, then data is called "structured." Analyzing becomes easier due to pre-defined structures and relations between data. A common type of structured data is a table.

- When you are able to place data in a relational database with a schema enforced, then data is called "structured." Analyzing becomes easier due to pre-defined structures and relations between data. A common type of structured data is a table. Unstructured Data - Though big data is a collection of variety of data, about 90% of big data is unstructured. Data that has its own internal structure but does not clearly fit into a database is "unstructured." This includes text documents, audio, video, image files, mails, presentations, web content, and streaming data.

- Though big data is a collection of variety of data, about 90% of big data is unstructured. Data that has its own internal structure but does not clearly fit into a database is "unstructured." This includes text documents, audio, video, image files, mails, presentations, web content, and streaming data. Semi-structured Data - This type of data cannot be accommodated in a relational database but can be tagged, which makes analyzing easier. XML, JSON, and NoSQL databases are considered to be semi-structured.

Big Data Ecosystem

Hadoop

Whenever we talk about big data, it is not uncommon to hear the word Hadoop.

"Hadoop is an open source framework that manages distributed storage and data processing for big data applications running in clusters." It is mainly used for batch processing. The core parts of Apache Hadoop are

Hadoop Distributed File System (HDFS) - used for storage.

MapReduce - used for processing.

Since data is large, Hadoop splits the files into blocks and distributes them across nodes in a cluster, which means every node has a copy of the data.

HDFS - The primary storage system used by Hadoop applications. HDFS is a distributed file system that stores files as Data Blocks and replicates it over other nodes.

- The primary storage system used by Hadoop applications. HDFS is a distributed file system that stores files as Data Blocks and replicates it over other nodes. MapReduce – MapReduce receives data from HDFS and splits the input data. Now processing can be done on all data parts simultaneously, which we call distributed processing.

How to Get Data Into a Big Data Environment

Sqoop - The word Sqoop is derived from "SQL + Hadoop," which clearly defines that it helps in transferring data between Hadoop and relational database servers. Thus, when the data is structured and in batches, you can use Sqoop as a loading tool to push it into Hadoop.

- The word Sqoop is derived from "SQL + Hadoop," which clearly defines that it helps in transferring data between Hadoop and relational database servers. Thus, when the data is structured and in batches, you can use Sqoop as a loading tool to push it into Hadoop. Apache Flume - A Data Flow used for efficiently collecting, aggregating, and pushing large amounts of streaming data into Hadoop.

- A Data Flow used for efficiently collecting, aggregating, and pushing large amounts of streaming data into Hadoop. Kafka - It is used on real-time streaming data to provide real-time analysis. Thus, when data is unstructured and streaming, Kafka and Flume together make the processing pipelines.

Where to Store the Data

HDFS - As said earlier, HDFS is the primary storage system for Hadoop applications.

- As said earlier, HDFS is the primary storage system for Hadoop applications. Apache HBase is a column-oriented data store built to run on top of the HDFS. It is a non-relational Hadoop Database.

is a column-oriented data store built to run on top of the HDFS. It is a non-relational Hadoop Database. Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data with no single point of failure.

How to Process the Data

Spark - The blooming of Apache Spark overtook MapReduce as Spark can perform in-memory processing while MapReduce has to read from and write to a disk. Hence, Spark is 100 times faster and allows data workers to efficiently execute streaming, machine learning, or SQL workloads. Then we also have emerging tools like Storm, Samza, and Flink.

- The blooming of Apache Spark overtook MapReduce as Spark can perform in-memory processing while MapReduce has to read from and write to a disk. Hence, Spark is 100 times faster and allows data workers to efficiently execute streaming, machine learning, or SQL workloads. Then we also have emerging tools like Storm, Samza, and Flink. Hive - Hive makes work easier for SQL developers as it provides a SQL-like interface for interacting with stored data. Apache Hive is a data warehouse software project built on top of Apache Hadoop for querying and analysis.

- Hive makes work easier for SQL developers as it provides a SQL-like interface for interacting with stored data. Apache Hive is a data warehouse software project built on top of Apache Hadoop for querying and analysis. Impala - Impala is similar to Hive — it is a distributed SQL query engine for Apache Hadoop.

- Impala is similar to Hive — it is a distributed SQL query engine for Apache Hadoop. Apache Pig - Since all the processing tools like MapReduce and Spark require some knowledge of programming languages which most of data analysts will not be familiar with, Apache Pig was developed at Yahoo. It uses a language called Pig Latin to analyze massive datasets.

Data Analytics and Business Intelligence Tools

Now that we have figured out how to collect, store, and process the data, we need some tool for visualizing the data to make business intelligence possible. There are various business intelligence tools which can add value to big data like Alibaba Cloud's DataV and QuickBI.

Resource Management and Scheduling

Apart from this main cycle, we will also be focusing on some Resource Management tools like:

YARN - Yet another Resource Negotiator

- Yet another Resource Negotiator ZooKeeper

Other scheduling tools like Oozie, Azkaban, Cron, and Luigi play a major roles in scheduling Hadoop and Sqoop jobs when you have ‘n’ number of tasks listed.

Big Data in Today’s Business

At the end of the day, it's up to organizations to use all their data to create valuable insights and transform their business. Every organization has its own data in huge volumes; the more efficiently the data is used, the more potential the company has to grow. Business insights produced by this entire process can be utilized by organizations to increase their efficiency and make better decisions — a better way to outsmart their peers and competitors in the market.

In the next article, we will show you how to build a big data environment on Alibaba Cloud with Object Storage Service and E-MapReduce.