Migrating data from Hadoop to Google BigQuery is a fairly straightforward process. DistCP is usually leveraged to push or pull data from the on premises Hadoop cluster to Cloud Storage. While this works for the vast majority of situations, special care must be taken when migrating data from Hive transactional tables. The way Hive manages the underlying data is fundamentally different than traditional immutable tables and special care must be exercised. This blog will explore the impact the Hive transactional tables have on the migration process and identify the steps to take to avoid them.

Hive ACID in action, and why it’s problematic for BigQuery migrations

Before Hive version 0.14.0, Hive tables were immutable. Any updates or deletes that needed to be applied to tables typically involved recreating the tables with the changes applied, a time-consuming and computationally expensive process. Starting with version 0.14.0, Hive added the ability to support transactions that accept updates and delete DML operations with ACID support. This opened the door to many new use cases and dramatically simplified the process of updating data in Hive.

In the following example, the TPC-H dataset has been generated and stored in HDFS and will be used to demonstrate nuances of ACID tables. The data has been loaded into Hive and is stored as an ORC file with ACID transactions enabled. The below image displays the row counts for the lineitem table in Hive, containing around 60 million records.

A quick examination of the underlying directory structure for the table shows the following layout in HDFS.

There is a single delta directory holding 12 different ORC files, along with a file that describes the ORC ACID version (file _orc_acid_version) that was used to load the data. When DML operations are applied to the table, additional delta file directories will be created to manage the insert, update, and delete operations.

To demonstrate how these operations change the underlying directory structure and generate new delta files, we’ll apply a simple SQL update to 15,000,562 rows, like this:

update lineitem set l_shipinstruct=’N/A’ where l_shipinstruct=’NONE’;

After the update, there are now three distinct delta directories: one for newly inserted data, one for deletes (insert and delete operation is an update), and the original directory. Looking closely, the original directory has the same file sizes as it did prior to the update, so no data was modified in place.

In order for Hive to read data in this table, it must read the data in all three of these directories. It scans the delete_delta directory to identify the delete records, then scans the base directory, skipping any records that match with deletes. Finally, it looks at the recent inserts and merges the two to give a consistent view of the data. As more and more updates are made to the table, the more overhead there is to collate the current view of the data.

BigQuery only knows how to process data loaded to it with a compatible file format. If this data were loaded into BigQuery, the result would be unusable. Below is a simple count(*) query looking at the number of records loaded into BigQuery after a DistCP job is used to copy data from from HDFS into Cloud Storage.

BQ Count

The original lineitem table in Hive had around 60 million records. Now, that number has ballooned to nearly 90 million records! Present in this count are the deletes as well as the inserts for the update operation. The table originally was loaded with 59,986,052 records and 15,000,562 were updated. If we account for the number of deletes being equivalent to the number of inserts for the update operation, there are 89,987,176 total records. This matches the record count present in the Hive table.

A simple query on the table reveals some interesting results. There are several fields that weren’t originally present in our Hive DDL. Columns like operation, originalTransaction, bucket, rowid, and currentTransaction are metadata fields that Hive Metastore uses to keep track of the DML operations. Think of it as Hive’s internal primary key for transactional tables. These fields are hidden in Hive but still present in the physical ORC file schema. Since BigQuery uses the embedded ORC schema to build the table DDL, these metadata fields will be visible.

The simple select statement also shows many fields with null for all the table’s columns. This is because these records indicate a delete operation. In fact, the operation field identifies which DML operation created the record. 0 identifies an insert, 1 identifies an update, and 2 identifies a delete operation. Hive uses all these metadata fields to construct what records to include/exclude when assembling the current view of a table.

To see the full effect of this, the following query will identify a single record that had an update applied, along with the resulting output of that operation.

select operation, originalTransaction, bucket,rowid, currentTransaction, row.l_shipinstruct,row.l_orderkey, row.l_partkey from tpch.lineitem10g where bucket=537395200 and rowid=78644 order by originalTransaction asc, bucket asc, rowid asc, currentTransaction desc

A single update yielded three records! How can BigQuery identify the current record among all this unnecessary data? You could write an OLAP function to get the most recent record, but that would be complex and require additional DML operations to clean up the data. This problem is best solved at the source using native hive tools and is the recommended approach when migrating from Hive ACID tables.

Using Hive compactions

Over time, it will become inefficient for Hive to work with tables with a large amount of DML operations. Query latencies, along with the number of small files for a given table, will increase, resulting in unnecessary I/O. To combat this, Hive will run compactions that consolidate these delta files into a more manageable file set. These compactions can be executed manually or automatically by Hive when a delta size threshold has been met. When a compaction runs, there are two variants that can be executed. Minor compactions attempt to compact all the delta files into a single delta file within a given bucket. A major compaction, however, consolidates all delta files in the current base file into a new base file per bucket. It is recommend to run a major compaction for migrating data to BigQuery. Minor compactions, while helpful, don’t consolidate the data to a level that is required for migrations to BigQuery.

Pro tip: Before running Hive compactions, work with your Hadoop administrator to schedule running the compaction operation. Depending on the size of the table, major compactions can be very resource-intensive, so take caution when running on production systems. Also ensure that the loads to the Hive table are stopped prior to the compaction and migration. Any new data added post-compaction would result in the creation of new delta files.

In the example below, a major compaction is run on the table that previously had updates applied.

Running a compaction creates a MapReduce job that compacts the table in the background. In Hive, running the command show compactions can display the running compaction jobs in the background.

Once the compaction has been completed, the table directory should look much as it was prior to the first update.

When the data is reloaded into BigQuery, the results will now match the row count totals in Hive. The compaction process eliminated all the ancillary delta data and created a single consolidated file set. Now you can migrate Hive ACID tables confidently, without fear of mismatching row counts.

Pro tip: Once the data has been moved to Cloud Storage, the _metadata_acid and _orc_acid_version files will need to be deleted prior to loading into BigQuery. Instead of deleting the files one by one, you can automate the process with gsutil. Gsutil can recursively list the objects in a bucket, and you can grep the file names and direct the output to a file. Below, the command is grepping the _metadata_acid objects and saving the object path into an output file.

gsutil ls -r gs://[bucket name]/tpch | grep _metadata_acid >> input.dat

The input file can be directed to stdin and fed to gsutil to delete the objects.

cat input.dat | gsutil -m rm -I

Planning Hadoop migrations to BigQuery

This section is primarily for Hadoop clusters running the Hortonworks distribution. Hortonworks was the only Hadoop distribution to support ACID transactions.

Clusters running HDP 2.6.x or HDP 3.x Hadoop distributions need to pay close attention to whether or not the tables being migrated have transactions enabled. Determining whether or not a table is transactional depends on your HDP release version. If the cluster is running HDP 2.6.x, first check to see if transactions are enabled in Ambari. Clusters running HDP 3.x have ACID transactions enabled by default, so if you have an ORC table, the table is likely transactional. This will save time combing through DDL when it isn’t necessary. If transactions are enabled, check the table DDL to determine which tables have transactions. To identify tables with transactions enabled, look for the table property “transactions=true.” Finally, check to see if the data is even updated. While transactions may be enabled, the table may not be receiving updates or deletes and therefore won’t need a major compaction. Understanding the use of the table will go a long way in scoping out the migration and reducing the amount of effort to migrate the data to BigQuery.

For more information on migrating your Hadoop infrastructure to Google Cloud, check out these links for more information: