Apache Pig

The Apache Pig is a platform for managing large sets of data which consists of high-level programming to analyze the data. Pig also consists of the infrastructure to evaluate the programs. The advantages of Pig programming is that it can easily handle parallel processes for managing very large amounts of data. The programming on this platform is basically done using the textual language Pig Latin.

Pig Latin comes with the following features:

Simple programming: it is easy to code, execute and manage

Better optimization: system can automatically optimize the execution

Extensive nature: it can be used to achieve highly specific processing tasks

Pig can be used for following purposes:

ETL data pipeline

Research on raw data

Iterative processing.

The scalar data types in pig are int, float, double, long, chararray, and bytearray. The complex data types in Pig are map, tuple, and bag.

Map: The data element with the data type chararray where element has pig data type include complex data type

Example- [city’#’bang’,’pin’#560001]

In this city and pin are data element mapping to values.

Tuple: It is a collection of data types and it has fixed length. Tuple is having multiple fields and these are ordered.

Bag: It is a collection of tuples, but it is unordered, tuples in the bag are separated by comma

Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)

LOAD function:

Load function helps to load data from the file system. It is a relational operator. In the first step in data-flow language we need to mention the input, which is completed by using ‘load’ keyword.

The LOAD syntax is

LOAD ‘mydata’ [USING function] [AS schema];

Example- A = LOAD ‘bigdatapath.txt’;

A = LOAD ‘bigdatapath.txt’ USINGPigStorage(‘\t’);

The relational operations in Pig:

foreach, order by, filters, group, distinct, join, limit.

foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.

A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );

B = foreach A generate emp_name, emp_id;

Filters: It contains a predicate and it allows us to select which records will be retained in our data pipeline.

Syntax: alias = FILTER alias BY expression;

Alias indicates the name of the relation, By indicates required keyword and the expression has Boolean.

Example: M = FILTER N BY F5 == 4;

Apache Sqoop

Apache Sqoop is a tool that is extensively used to transfer large amounts of data from Hadoop to the relational database servers and vice-versa. Sqoop can be used to import the various types of data from Oracle, MySQL and such other databases.

Important Sqoop control commands to import RDBMS data

Append: Append data to an existing dataset in HDFS. –append

Columns: columns to import from the table. –columns <col,col……>

Where: where clause to use during import. –where <where clause>

The common large objects in Sqoop are Blog and Clob.Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set lob limit as ZERO (0) then it is stored in external memory.

Sqoop allows to Export and Import the data from the data table based on the where clause. The syntax is

–columns <col1,col2……>

–-where <condition>

–-query <SQL query>

Example:

sqoop import –connect jdbc:mysql://db.one.com/corp –table BIGDATAPATH_EMP –where “start_date> ’2016-07-20’ ”

sqoopeval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM bigdatapath_emp LIMIT 20”

sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –columns “name,emp_id,jobtitle”

Sqoop supports data imported into following services:

HDFS

Hive

HBase

Hcatalog

Accumulo

Sqoop needs a connector to connect the different relational databases. Almost all Database vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC driver of the database for interaction.

No, Sqoop needs JDBC and a connector to connect a database.

Sqoop command to control the number of mappers

We can control the number of mappers by executing the parameter –num-mapers in sqoop command. The –num-mappers arguments control the number of map tasks, which is the degree of parallelism used. Start with a small number of map tasks, then choose a high number of mappers starting the performance may down on the database side.

Syntax: -m, –num-mappers <n>

Sqoop command to show all the databases in MySQL server

$ sqoop list –databases –connect jdbc:mysql://database.test.com/

Sqoopmetastore

It is a tool for using hosts in a shared metadata repository. Multiple users and remote users can define and execute saved jobs defined in metastore. End users configured to connect the metastore in sqoop-site.xml or with the

–meta-connect argument.

The purpose of sqoop-merge is:

This tool combines 2 datasets where entries in one dataset overwrite entries of an older dataset preserving only the new version of the records between both the data sets.

Apache Hive

The Apache Hive is a data warehouse software that lets you read, write and manage huge volumes of datasets that is stored in a distributed environment using SQL. It is possible to project structure onto data that is in storage. Users can connect to Hive using a JDBC driver and a command line tool.

Hive is an open system. We can use Hive for analyzing and querying in large datasets of Hadoop files. It’s similar to SQL. The present version of Hive is 0.13.1.

Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transaction.

Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do.

Hive is most suitable for following data warehouse applications

Analyzing the relatively static data

Less Responsive time

No rapid changes in data.

Hive doesn’t provide fundamental features required for OLTP, Online Transaction Processing. Hive is suitable for data warehouse applications in large data sets.

The two types of tables in Hive

Managed table

External table

We can change the settings within Hive session, using the SET command. It helps to change Hive job settings for an exact query.

Example: The following commands shows buckets are occupied according to the table definition.

hive> SET hive.enforce.bucketing=true;

We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set byHive.

hive> SET hive.enforce.bucketing;

hive.enforce.bucketing=true

And this list will not include defaults of Hadoop. So we should use the below like

SET -v

It will list all the properties including the Hadoop defaults in the system.

Add a new node with the following steps

1)Take a new system – create a new username and password

2) Install the SSH and with master node setup ssh connections

3) Add sshpublic_rsa id key to the authorized keys file

4) Add the new data node hostname, IP address and other details in /etc/hosts slaves file

192.168.1.102 slave3.in slave3

5) Start the DataNode on New Node

6) Login to the new node like suhadoop or ssh -X hadoop@192.168.1.103

7) Start HDFS of a newly added slave node by using the following command

./bin/hadoop-daemon.sh start data node

8) Check the output of jps command on a new node.