myData

appId

ts

appId

ts

appId

ts

appId

ts

appId

myCluster

m1.small

m1.small

myData

CREATE EXTERNAL TABLE myData (appId BIGINT, ts BIGINT)

STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (

'dynamodb.table.name' = 'myData',

'dynamodb.column.mapping' = 'appId:appId,ts:ts',

'dynamodb.throughput.read.percent' = '1.0'

);

SELECT count(1) FROM myData;

appIds

appId

SELECT count(1) FROM myData WHERE appId = 0;

appId

SELECT count(1) FROM myData WHERE appId = 0 OR appId = 9;

SELECT

sum(value)

FROM (

SELECT count(1) AS value FROM myData WHERE appId = 0

UNION ALL

SELECT count(1) AS value FROM myData WHERE appId = 9

) u;

org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler

hive-bigbird-handler.jar

We demonstrate two semantically equivalent HiveQL queries that use Amazon DynamoDB 's storage handler but compile to MapReduce jobs with vastly different performance.The combination of Amazon DynamoDB, Amazon Elastic MapReduce (Amazon EMR) and Apache Hive is a useful one when storing and analyzing large amounts of data. We can use Amazon DynamoDB to store time-stamped information across several applications by creating a table,, with the following structure:We set the primary key type to be a hash and range. We set the hash attribute name toand type to number, and we set the range attribute name to(for a Unix timestamp) and type to number. We can store any number of items with an arbitrary set of key-value pairs that are indexed by anand. If we want to store more than one item with the sameandthen we can use time-based UUIDs instead of plain Unix timestamps.In the following example we set the read and write provisioned throughputs to one unit each. We populate the table using a Node.js script . This script sequentially adds 100 items to the table. Each item contains the two required attributes:, which ranges from 0 to 9, and, which ranges from 0 to 99 for each. The script warns when it exceeds the level of provisioned read throughput for the table, automatically backing off and retrying to ensure that all items are properly recorded.Now, we spin up an Amazon EMR cluster,, to perform the analyses. We chose "Hive program" for the job flow and "Start an Interactive Hive Session". We leave all other options unchanged: the master instance type is small (), the core instance count is two, the core instance type is small () and the task instance count is zero.When the cluster has started, we SSH into the master node and start an interactive Hive session. We connect toas follows:We can count the total number of items in the table using:This correctly returns 1000 and takes approximately 1045 seconds to complete. The bottleneck is the table's provisioned read throughput. It takes approximately 1000 seconds to read the 1000 items; the remaining time is spent compiling the HiveQL statements into MapReduce jobs and submitting the jobs to the cluster for execution. If we want to improve the query's performance we can increase the provisioned read throughput.Suppose we only want to analyze a subset of the data in our table. Of our ten, maybe only some of them are active at any one time and we only want to count the numbers of items with a specified. We can use the following:This correctly returns 100 and takes approximately 143 seconds to complete. This is reasonable. The hash for the table isso we only need to read 100 items from the table, but of course we incur a similar penalty as before for compiling the HiveQL statements and submitting the jobs to the cluster.What happens if we have more than one active application?This correctly returns 200 but takes approximately 1037 seconds to complete! The query is scanning the entire table even though the attribute or column in the predicate is indexed.What about the following query?This also returns 200 but takes approximately 327 seconds to complete. Amazon DynamoDB's storage handler () uses a query when there is only one condition on the hash in the WHERE clause. However, it uses a scan when there is more than one condition. This implementation is defined inon the Amazon EMR cluster. Unfortunately, Amazon hasn't released the source-code for the storage handler. It is also against Amazon's rules to decompile this file (see 8.5 License Restrictions ).The HiveQL language abstracts away the procedural steps of a series of MapReduce jobs, allowing us to focus on what we want to retrieve and/or manipulate. However, this means that statements that are semantically equivalent, may compile to MapReduce jobs with vastly different performance. All course, all non-trivial abstractions are, to some degree, leaky