What is Big Data?

What is Big Data?: Before we start with this section about what is Big Data, it is important for you to understand ‘’ For that first, you need to understand what data is. So, what is data? Data can be defined as the figures or facts which can be stored in or can be used by a Read More

Hadoop Architecture Overview

How does Hadoop work?: Hadoop Architecture Apache Hadoop was developed with the goal of having an inexpensive, redundant data store that would enable organizations to leverage Big Data Analytics economically and increase the profitability of the business. A Hadoop architectural design needs to have several design factors in terms of networking, computing power, and storage. Hadoop provides a reliable, scalable, flexible, and distributed computing Read More

Hadoop Installation

Hadoop Installation: In this section of the Hadoop tutorial, we will be talking about the Hadoop installation process. Hadoop is basically supported by the Linux platform and its facilities. If you are working on Windows, you can use Cloudera VMware that has preinstalled Hadoop, or you can use Oracle VirtualBox or the VMware Workstation. In this tutorial, I will be Read More

Introduction to Hadoop

What is Apache Hadoop?: Apache Hadoop was born to enhance the usage and solve major issues of big data. The web media was generating loads of information on a daily basis, and it was becoming very difficult to manage the data of around one billion pages of content. In order of revolutionary, Google invented a new methodology of processing data Read More

Hadoop Ecosystem

: What is Hadoop Ecosystem? Core Hadoop ecosystem is nothing but the different components that are built on the Hadoop platform directly. However, there are a lot of complex interdependencies between these systems. Watch this Hadoop Video before getting started with this tutorial! [videothumb class="col-md-12" id="29O3CCYOzic" alt="Hadoop Training" title="Hadoop Training"] [intytsubsc] Before starting this Hadoop ecosystem tutorial, let's see what we Read More

What is HDFS?

HDFS in Hadoop: So, what is HDFS? HDFS or Hadoop Distributed File System, which is completely written in Java programming language, is based on the Google File System (GFS). Google had only presented a white paper on this, without providing any particular implementation. It is interesting that around 90 percent of the GFS architecture has been implemented in HDFS. HDFS Read More

HDFS Operations

: Starting HDFS Format the configured HDFS file system and then open the namenode (HDFS server) and execute the following command. $ hadoop namenode -format Start the distributed file system and follow the command listed below to start the namenode as well as the data nodes in cluster. $ start-dfs.sh Watch this Big Data & Hadoop Full Course - Learn Hadoop Read More

MapReduce in Hadoop

What is MapReduce in Hadoop?: Now that you know about HDFS, it is time to talk about MapReduce. So, in this section, we're going to learn the basic concepts of MapReduce. We will learn MapReduce in Hadoop using a fun example! MapReduce in Hadoop is nothing but the processing model in Hadoop. The programming model of MapReduce is designed to Read More

What is Yarn

What is YARN in Hadoop?: So, what is YARN in Hadoop? Apache YARN (Yet Another Resource Negotiator) is a resource management layer in Hadoop. YARN came into the picture with the introduction of Hadoop 2.x. It allows various data processing engines such as interactive processing, graph processing, batch processing, and stream processing to run and process data stored in HDFS Read More

What is Pig in Hadoop?

What is Pig in Hadoop?: Pig Hadoop is basically a high-level programming language that is helpful for the analysis of huge datasets. Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to perform a lot of data administration operations. Watch this video on ‘Apache Pig Tutorial’: [videothumb class="col-md-12" id="PY3TpaZiCGQ" alt="Apache Pig Tutorial" title="Apache Pig Tutorial"] For writing Read More

Hadoop Hive: An In-depth Hive Tutorial for Beginners

Hadoop Hive: Apache Hive is an open-source data warehouse system that has been built on top of Hadoop. You can use Hive for analyzing and querying large datasets that are stored in Hadoop files. Processing structured and semi-structured data can be done by using Hive. Let’s look at the agenda for this section first: What is Hive in Read More

Streaming

Introduction to Hadoop Streaming: Hadoop Streaming uses UNIX standard streams as the interface between Hadoop and your program so you can write Mapreduce program in any language which can write to standard output and read standard input. Hadoop offers a lot of methods to help non-Java development. The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Read More

Multi-Node Cluster

Setting Up A Multi Node Cluster In Hadoop: Installing Java Syntax of java version command $ java -version Following output is presented. java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) Learn more about Hadoop with the help of this YouTube tutorial: [videothumb class="col-md-12" id="RDD6NSCayso" alt="Hadoop Projects" title="Hadoop Projects"] Creating User Account Read More

HBase

Architecture of HBase Cluster: HBase: The Hadoop Database It is an open source platform and is horizontally scalable. It is the database which distributed based on the column oriented. It is built on top most of the Hadoop file system. It is based on the non relational database system (NoSQL). HBase is truly and faithful, open source implementation devised on Google’s Bigtable. Watch this video Read More

Sqoop and Impala

: Sqoop Sqoop is an automated set of volume data transfer tool which allows to simple import, export of data from structured based data which stores NoSql systems, relational databases and enterprise data warehouses to Hadoop ecosystems. Watch this video on Hadoop before going further on this Hadoop tutorial [videothumb class="col-md-12" id="qskfdqsK9fk" alt="Hadoop Training for Beginners" title="Hadoop Training for Beginners"] Key features Read More

Oozie Tutorial

What is Oozie in Hadoop?: Apache Oozie is a scheduler system used to run and manage Hadoop jobs in a distributed environment. Oozie supports combining multiple complex jobs that run in a particular order for accomplishing a more significant task. With Oozie, within a particular set of tasks, two or more jobs can be programmed to run in parallel. Let’s Read More

Apache Flume Tutorial

What is Apache Flume in Hadoop?: Apache Flume is basically a tool or a data ingestion mechanism responsible for collecting and transporting huge amounts of data such as events, log files, etc. from several sources to one central data store. Apache Flume is a unique tool designed to copy log data or streaming data from various different web servers to Read More

Zookeeper and Hue

: Zookeeper It allows the distribution of processes to organize with each other through a shared hierarchical name space of data registers. Zookeeper Service is replicated or duplicated over a set of machines. All machines save a copy of the data in memory set. A leader is chosen based on the service startup Clients is only connected to a single Zookeeper Read More

Hive cheat sheet

Introduction: : All the industries deal with the Big data that is large amount of data and Hive is a tool that is used for analysis of this Big Data. Apache Hive is a tool where the data is stored for analysis and querying. This cheat sheet guides you through the basic concepts and commands required to start with it. You Read More

PIG Basics Cheat Sheet

Pig Basics User Handbook: Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will be Read More

Big Data Solutions

Differentiation between Operational vs. Analytical Systems: Operational Analytical Latency 1 ms to 100 ms 1 min to 100 min Concurrency 1000 to100,000 1 to 10 Access Pattern Writes and Reads Reads Queries Selective Unselective Data Scope Operational Retrospective End User Customer Data Scientist Technology NoSQL Database MapReduce, MPP Database Traditional Enterprise Approach This approach of enterprise will use a computer Read More

PIG Built-in Functions Cheat Sheet

Pig Built-in Functions User Handbook: Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will Read More