Hadoop Training Syllabus

Module 1 (Duration :06:00:00)

Introduction to Big Data & Hadoop Fundamentals Goal : In this module, you will understand Big Data, the limitations of the existing solutions for Big Data problem, how Hadoop solves the Big Data problem, the common Hadoop ecosystem components, Hadoop Architecture, HDFS, Anatomy of File Write and Read, how MapReduce Framework works. Objectives – Upon completing this Module, you should be able to understand Big Data is a term applied to data sets that cannot be captured, managed, and processed within a tolerable elapsed and specified time frame by commonly used software tools.

Big Data relies on volume, velocity, and variety with respect to processing.

Data can be divided into three types—unstructured data, semi-structured data, and structured data.

Big Data technology understands and navigates big data sources, analyzes unstructured data, and ingests data at a high speed.

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

Topics: Apache Hadoop

Introduction to Big Data & Hadoop Fundamentals

Dimensions of Big data

Type of Data generation

Apache ecosystem & its projects

Hadoop distributors

HDFS core concepts

Modes of Hadoop employment

HDFS Flow architecture

HDFS MrV1 vs. MrV2 architecture

Types of Data compression techniques

Rack topology

HDFS utility commands

Min h/w requirements for a cluster & property files changes

Module 2 (Duration :03:00:00)

MapReduce Framework Goal : In this module, you will understand Hadoop MapReduce framework and the working of MapReduce on data stored in HDFS. You will understand concepts like Input Splits in MapReduce, Combiner & Partitioner and Demos on MapReduce using different data sets. Objectives – Upon completing this Module, you should be able to understand MapReduce involves processing jobs using the batch processing technique.

MapReduce can be done using Java programming.

Hadoop provides with Hadoop-examples jar file which is normally used by administrators and programmers to perform testing of the MapReduce applications.

MapReduce contains steps like splitting, mapping, combining, reducing, and output.

Topics: Introduction to MapReduce

MapReduce Design flow

MapReduce Program (Job) execution

Types of Input formats & Output Formats

MapReduce Datatypes

Performance tuning of MapReduce jobs

Counters techniques

Module 3 (Duration :03:00:00)

Apache Hive Goal : This module will help you in understanding Hive concepts, Hive Data types, Loading and Querying Data in Hive, running hive scripts and Hive UDF. Objectives – Upon completing this Module, you should be able to understand Hive is a system for managing and querying unstructured data into a structured format.

The various components of Hive architecture are megastore, driver, execution engine, and so on.

Metastore is a component that stores the system catalog and metadata about tables, columns, partitions, and so on.

Hive installation starts with locating the latest version of the tar file and downloading it in the Ubuntu system using the wget command.

While programming in Hive, use the show tables command to display the total number of tables.

Topics: Introduction to Hive & features

Hive architecture flow

Types of hive tables flow

DML/DDL commands explanation

Partitioning logic

Bucketing logic

Hive script execution in shell & HUE

Module 4 (Duration: 03:00:00)

Apache Pig Goal: In this module, you will learn Pig, types of use case we can use Pig, tight coupling between Pig and MapReduce, and Pig Latin scripting, PIG running modes, PIG UDF, Pig Streaming, Testing PIG Scripts. Demo on healthcare dataset. Objectives – Upon completing this Module, you should be able to understand Pig is a high-level data flow scripting language and has two major components: Runtime engine and Pig Latin language.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

Pig engine can be installed by downloading the mirror web link from the website: pig.apache.org.

Topics:

Introduction to Pig concepts

Pig modes of execution/storage concepts

Pig program logics explanation

Pig basic commands

Pig script execution in shell/HUE

Module 5 (Duration :03:00:00)

Goal: This module will cover Advanced HBase concepts. We will see demos on Bulk Loading, Filters. You will also learn what Zookeeper is all about, how it helps in monitoring a cluster, why HBase uses Zookeeper. Objectives – Upon completing this module, you should be able to understand HBaseha’s two types of Nodes—Master and RegionServer. Only one Master node runs at a time. But there can be multiple RegionServersat a time.

The data model of Hbasecomprises tables that are sorted by rows. The column families should be defined at the time of table creation.

There are eight steps that should be followed for the installation of HBase.

Some of the commands related to HBaseshell create, drop, list, count, get, and scan.

Topics: Apache Hbase GangBoard.com

Introduction to Hbase concepts

Introduction to NoSQL/CAP theorem concepts

Hbase design/architecture flow

Hbase table commands

Hive + Hbase integration module/jars deployment

Hbase execution in shell/HUE

Module 6 (Duration :02:00:00)

Goal: Sqoop is an Apache Hadoop Eco-system project whose responsibility is to import or export operations across relational databases. Some reasons to use Sqoop are as follows:

SQL servers are deployed worldwide

Nightly processing is done on SQL servers

Allows to move certain part of data from traditional SQL DB to Hadoop

Transferring data using script is inefficient and time-consuming

To handle large data through Ecosystem

To bring processed data from Hadoop to the applications

Objectives – Upon completing this Module, you should be able to understand Sqoop is a tool designed to transfer data between Hadoop and RDBs including MySQL, MS SQL, Postgre SQL, MongoDB, etc.

Sqoop allows the import data from an RDB, such as SQL, MySQL or Oracle into HDFS.

Topics: Apache Sqoop

Introduction to Sqoop concepts

Sqoop internal design/architecture

Sqoop Import statements concepts

Sqoop Export Statements concepts

Quest Data connectors flow

Incremental updating concepts

Creating a database in MySQL for importing to HDFS

Sqoop commands execution in shell/HUE

Module 7 (Duration: 02:00:00)

Goal: Apache Flume is a distributed data collection service that gets the flow of data from their source and aggregates them to where they need to be processed.

Objectives – Upon completing this Module, you should be able to understand Apache Flume is a distributed data collection service that gets the flow of data from their source and aggregates the data to sink.

Flume provides a reliable and scalable agent mode to ingest data into HDFS.

Topics: Apache Flume

Introduction to Flume & features

Flume topology & core concepts

Property file parameters logic

Module 8 (Duration :02:00:00)

Goal : Hue is a web front end offered by the ClouderaVM to Apache Hadoop. Objectives – Upon completing this Module, you should be able to understand how to use hue for hive,pig,oozie. Topics: Apache HUE

Introduction to Hue design

Hue architecture flow/UI interface

Module 9 (Duration :02:00:00)

Goal: Following are the goals of ZooKeeper:

Serialization ensures avoidance of delay in reading or write operations.

Reliability persists when an update is applied by a user in the cluster.

Atomicity does not allow partial results. Any user update can either succeed or fail.

Simple Application Programming Interface or API provides an interface for development and implementation.

Objectives – Upon completing this Module, you should be able to understand ZooKeeper provides a simple and high-performance kernel for building more complex clients.

ZooKeeper has three basic entities—Leader, Follower, and Observer.

Watch is used to get the notification of all followers and observers to the leaders.

Topics: Apache Zookeeper

Introduction to zookeeper concepts

Zookeeper principles & usage in Hadoop framework

Basics of Zookeeper

Module 10 (Duration :05:00:00)

Goal: Explain different configurations of the Hadoop cluster

Identify different parameters for performance monitoring and performance tuning

Explain configuration of security parameters in Hadoop.

Objectives – Upon completing this Module, you should be able to understand Hadoop can be optimized based on the infrastructure and available resources.

Hadoop is an open-source application and the support provided for complicated optimization is less.

Optimization is performed through xml files.

Logs are the best medium through which an administrator can understand a problem and troubleshoot it accordingly.

Hadoop relies on the Kerberos based security mechanism.

Topics: Administration concepts