Big data analytic services provider explains how to load data into Hive partition column using Pig. The experts are introducing Pig technology as well and also telling the way to use it for loading data into Hive. Read further to know more what they say. What exactly is Big Data? As per one of the analysis we create approximately 2.5 quintillion bytes of data everyday – the pace at which data is growing can be summed up by saying that 80% - 85% of the whole data has been created in the last 2 years. It has volume, velocity and variety and IBM adds the latest V – Veracity to it, this much data is process for useful insight and decision making. Today, we gather data from everywhere, whether it can be sensor data, cell phone data, stock exchange data, social media data and transactional data to name a few. Above all stuff can be named as Big Data. And to process this much amount of data we have Hadoop framework in place where we can ingest the data, process the data and schedule it. Processing of data which includes transformation/cleaning/profiling of the raw data leads to the refined data which can be useful for further analytics to extract meaningful information for the organizations, for processing of data there are 2 main tools in Hadoop framework – Hive and Pig.

Technology: Pig is mainly used to clean/filter the raw or unstructured data and Hive is used to query the data which we have stored in DWH or Data Lake, both (Hive and Pig) of them go hand on hand and interact with each other whenever required. Partitioning in Hive and loading data into Hive table after cleaning or profiling data from Pig are very common trends but we still have few limitations when we try to merge both the methods. Use-case: Below is the sample data screen shot which we will use to load into Hive table.

Create Hive table named emp_tab.

Load the data into the Hive table which we have created above.

Created another Hive table but with partitioning column name – ‘part’.

Setting Dynamic partitioning conditions to TRUE.

Below is the output of the partitioning table, one can see data partition column also.

We have taken sample data to load it into Pig, which would be further used to move into Hive table.

Enter into Pig with HCatalog option.

Load the data into Pig relation ‘A’ from the HDFS path.

Appending the above stored data from Pig to the Hive table – emp_tab (non-partitioned).

Below screenshot shows that data has been successfully added in Hive table with 107 and 108 as id’s.

Issue: After successfully loading data from Pig relation to Hive table (non-partition), we thought it would behave the same for the non-partition table. So wetried to load the data into partition table – emp_tab_part from Pig. To achieve the above requirement we have to add another column in Pig relation. We will create another relation – ‘B’ and add the column name – ‘part’ into it and store the data into ‘emp_tab_part’ table traditionally.

But it failed with below error.

Resolution: After few research, we found that we cannot load data into Hive partition column from pig relation dynamically. If we still have to load the data statically then we have to follow the below steps. Below are the partitions sub-directories under the table directory ‘emp_tab_part’.

We have separate out the pig data according to the partition column placed in Hive table. In our example, partition column in based on year so we will put record with year 1987 in one relation (B_1987) and record with year 1988 under another relation (B_1988). Then generated 4th column with name ‘part’ with the year column.

Now use the STORE command to load the data into 1st partition (part=1987) from B_part1 but make sure to add the name of the partition directory (in HDFS) in which you want to load the data.

Follow same steps for the other partition – 1988.

Data has been successfully loaded into partition Hive table.