Over half a billion videos are watched on JW Player video player every day resulting in about 7 billion events a day which generates approximately 1.5 to 2 terabytes of compressed data every day. We, in the data team here at JW Player, have built various batch and real time pipelines to crunch this data in order to provide analytics to our customers. For more details about our infrastructure, you can look at JW at Scale and Fast String Matching. In this post, I am going to discuss how we got Hive with Tez running in our batch processing pipelines.

All of our pipelines run on AWS and a significant portion of our daily batch pipelines code is written in Hive. These pipelines run from 1 to 10 hours every day to clean and then aggregate this data. We have been looking ways to optimize these pipelines.

Tez

We recently came across Tez, which is a framework for YARN-based data processing applications in Hadoop. Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a complex dataflow graph. One can run any Hive or Pig job using the Tez engine without any modification to the code. We have been reading a lot about how Tez can help boost performance and hence we decided to give it a try and see how it could potentially fit in our pipelines.

Prerequisites for running Tez with Hive:

Hadoop Cluster with YARN framework

Hive-0.13.1 or later version

Tez installed and configured on hadoop cluster

Experiments

We used Boto project's Python API to launch Tez EMR cluster. Amazon has open sourced Tez bootstrapping script (though it’s an older version) and we used it to bootstrap Tez on EMR cluster for our initial testing. Below is a sample script to launch EMR cluster with Tez:

https://gist.github.com/rohitgarg/9c8071fc7c9eef6de879

After installing and configuring Tez on EMR cluster, one can easily run Hive tasks using Tez query engine by just setting the following property before running the query:

SET hive.execution.engine=tez;

We ran 2 tests to see if Tez can run on a large dataset like ours:

Test 1

Data Size of test task: 930 MB

Cluster Composition: m1.medium master node, 3 r3.8xlarge core nodes, 4 r3.8xlarge task nodes

Memory: 1.7 TB

CPU core’s: 224

Test Hive Query: FROM table1 INSERT OVERWRITE TABLE table2 PARTITION (day={DATE}) SELECT column1, column2, COUNT(DISTINCT column4), SUM(column5), SUM(column3), SUM(column6) WHERE day={DATE} AND an_id IS NOT NULL AND column2 IS NOT NULL GROUP BY column1, column2;

The first test was successful and the query ran just fine without changing any default Tez settings.

Now, we wanted to test Tez query engine against a larger dataset since many of our Hive tasks run on hundreds of gigabytes of data (sometimes terabytes) every day.

Test 2

Data Size of test task: 2.5 TB

Cluster Composition: m1.xlarge master node, 4 r3.8xlarge core nodes, 2 r3.8xlarge task nodes

Memory: 1.3 TB

CPU cores: 192

Test Hive Query: INSERT OVERWRITE TABLE table2 PARTITION(day) SELECT column1, column2, column3, column4, column5, SUM(column6), SUM(column7), SUM(column8), SUM(column9), SUM(column10), day FROM table1 WHERE day={DATE} GROUP BY column1, column2, column3, column4, column5, day;

This query was failing initially with out of memory errors. At this point, we started poking around default Tez settings to figure out best way to redistribute memory among Yarn containers, application masters and tasks.

We modified the following Tez’s properties to get rid of memory errors:

tez.am.resource.memory.mb: amount of memory to be used by the AppMaster

tez.am.java.ops: AppMaster java process memory size

tez.am.launch.cmd-opts: Command line options that are provided during the launch of the Tez AppMaster process

tez.task.resource.memory.mb: The amount of memory to be used by launched tasks

tez.am.grouping.max-size: Upper size limit (in bytes) of a grouped split, to avoid generating an excessively large split

hive.tez.container.size: size of Tez container

hive.tez.java.opts: tez java process memory size

By default, hive.tez.container.size = mapreduce.map.memory.mb and hive.tez.java.opts = mapreduce.map.java.opts

Below is an illustration for various Tez memory settings:

We used the following formulae to guide us in determining YARN and MapReduce memory configurations:

Number of containers = min (2 * cores, 1.8 * disks, (Total available RAM) / min_container_size)

Reserved Memory = Memory for stack memory

Total available RAM = Total RAM of the cluster - Reserved Memory

Disks = Number of data disks per machine

min_container_size = Minimum container size (in RAM). Its value is dependent on RAM available

RAM-per-container = max(min_container_size, (Total Available RAM) / containers)

For example, for our cluster, we had 32 CPU cores, 244 GB RAM, and 2 disks per node.

Reserved Memory = 38 GB

Container Size = 2 GB

Available RAM = (244-38) GB = 206 GB

Number of containers = min (2*32, 1.8* 2, 206/2) = min (64,3.6, 103) = ~4

RAM-per-container = max (2, 206/4) = max (2, 51.5) = ~52 GB

We used the above calculations just to get some idea and tried out different configuration settings.

One such setting we tried is:

set tez.task.resource.memory.mb=10000;

set tez.am.resource.memory.mb=59205;

set tez.am.launch.cmd-opts =-Xmx47364m;

set hive.tez.container.size=59205;

set hive.tez.java.opts=-Xmx47364m;

set tez.am.grouping.max-size=36700160000;

The query finally ran after tweaking the above properties.

Conclusion

Hopefully, this post will help you get started with running Tez on EMR. We still have to benchmark its performance against Hive and see if it makes sense for us to use Tez. If anyone has any comments or suggestions, you can contact me at rohit@jwplayer.com or Linkedin. Last but not least, I would like to thank Tez core team for bringing such a good product to life.