Step-by-step guide of how to proceed with twitter analytics tasks using Elastic MapReduce, DynamoDB and Amazon Data Pipeline.

In this post I will use Flume agent configured in previous post to deliver raw JSON data to S3 storage. Also, saying Twitter analytics i mean some aggregations like “Top 100 users mentioned per day” and “Top 100 Urls mentioned per day”.

Overall, similar process is described in amazon blog. But in my case i will cover a bit more details related to the twitter data.

Example of EMR cluster configuration:

Later in the post i will use AWS CLI tool to create and deploy jobs to the cluster, so is good to prepare it as well. For initial configuration of AWS Management Console run the following command:

aws configure 1 aws configure

Also you can follow instructions provided here if you have any issues with console configuration.

Once you have done with test cluster and AWS CLI configuration, we can proceed with actual data processing.. In this post i will create following things:

Hive metadata for S3 external table and DynamoDB external table

Hive script to load aggregated data from S3 to DynamoDB on daily basis

CMD and bash scripts to deploy and run everything on EMR cluster with Bootstrap actions and custom cluster steps

So, let’s prepare everything step-by-step..

Hive tables

Our twitter agent downloads the data in JSON format. To read the data from Hive table we need a Json SerDe library. Here is the most commonly used one: https://github.com/rcongiu/Hive-JSON-Serde. Is not shipped with Hive by default, so we need to compile it and put to the deployment folder.

create_tables.hql CREATE EXTERNAL TABLE twitter_publicstream ( coordinates String, retweeted BOOLEAN, source String, entities STRUCT< trends: string, symbols: string, urls: ARRAY< STRUCT< expanded_url: string, indices: ARRAY< INT >, display_url: string, url: string > >, hashtags: ARRAY< STRUCT< text: string, indices: ARRAY< INT > > >, user_mentions: ARRAY< STRUCT< id: BIGINT, name: string, indices: ARRAY< INT >, screen_name: string, id_str: string > > >, favorite_count INT, in_reply_to_status_id_str string, geo STRUCT<type:string,coordinates:string>, id_str string, in_reply_to_user_id String, timestamp_ms string, truncated BOOLEAN, text string, retweet_count INT, id String, in_reply_to_status_id String, possibly_sensitive BOOLEAN, filter_level string, created_at string, place STRUCT< id: string, place_type: string, bounding_box: STRUCT< type: string, coordinates: ARRAY< ARRAY< ARRAY< DOUBLE > > > >, name: string, attributes: string, country_code: string, url: string, full_name: string, country: string >, favorited BOOLEAN, lang string, contributors string, in_reply_to_screen_name string, in_reply_to_user_id_str string ) PARTITIONED BY (day_key Int) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://S3_BUCKET_NAME/flume/publicstream'; 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 CREATE EXTERNAL TABLE twitter_publicstream ( coordinates String, retweeted BOOLEAN , source String, entities STRUCT < trends: string, symbols: string, urls: ARRAY < STRUCT < expanded_url: string, indices: ARRAY < INT > , display_url: string, url: string > > , hashtags: ARRAY < STRUCT < text : string, indices: ARRAY < INT > > > , user_mentions: ARRAY < STRUCT < id: BIGINT , name : string, indices: ARRAY < INT > , screen_name: string, id_str: string > > > , favorite_count INT , in_reply_to_status_id_str string, geo STRUCT < type :string,coordinates:string > , id_str string, in_reply_to_user_id String, timestamp_ms string, truncated BOOLEAN , text string, retweet_count INT , id String, in_reply_to_status_id String, possibly_sensitive BOOLEAN , filter_level string, created_at string, place STRUCT < id: string, place_type: string, bounding_box: STRUCT < type : string, coordinates: ARRAY < ARRAY < ARRAY < DOUBLE > > > > , name : string, attributes: string, country_code: string, url: string, full_name: string, country: string > , favorited BOOLEAN , lang string, contributors string, in_reply_to_screen_name string, in_reply_to_user_id_str string ) PARTITIONED BY (day_key Int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://S3_BUCKET_NAME/flume/publicstream' ;

Here i created a table with all possible fields from twitter. In fact is not necessary and you can create table with the fields needed for your processing only.

Second table should be mapped to DynamoDB:

create_tables.hql CREATE EXTERNAL TABLE twitter_analytics_dynamodb ( day_key bigint, mention_screen_name string, cnt bigint ) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ( "dynamodb.table.name" = "twitter_analytics", "dynamodb.column.mapping" = "day_key:day_key,mention_screen_name:mention_screen_name,cnt:cnt" ); 1 2 3 4 5 6 7 8 9 CREATE EXTERNAL TABLE twitter_analytics_dynamodb ( day_key bigint , mention_screen_name string, cnt bigint ) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ( "dynamodb.table.name" = "twitter_analytics" , "dynamodb.column.mapping" = "day_key:day_key,mention_screen_name:mention_screen_name,cnt:cnt" );

Where DynamoDB table looks like:

Aggregation script

Add existing S3 partitions to hive metadata:

aggregate_scripts.hql msck repair table twitter_publicstream; 1 msck repair table twitter_publicstream;

Load data from S3 to DynamoDB (Top 100 User Mentioned by day):

aggregate_scripts.hql insert into table twitter_analytics_dynamodb select day_key, user_mention.screen_name as mention_screen_name, count(*) as cnt from twitter_publicstream lateral view explode(entities.user_mentions) mentions1 AS user_mention where day_key>'${hiveconf:LAST_SUCCESS_DAY}' group by day_key,user_mention.screen_name order by cnt desc limit 100; 1 2 3 4 5 6 7 8 9 10 11 insert into table twitter_analytics_dynamodb select day_key, user_mention.screen_name as mention_screen_name, count (*) as cnt from twitter_publicstream lateral view explode(entities.user_mentions) mentions1 AS user_mention where day_key & gt; '${hiveconf:LAST_SUCCESS_DAY}' group by day_key,user_mention.screen_name order by cnt desc limit 100;

Cluster deployment scripts

Let’s create bootstrap script for EMR cluster:

bootstrap.sh #!/bin/sh # put json serde lib to hive classpath sudo aws s3 cp s3://S3_BUCKET/emr_deployment/hivelibs /usr/lib/hive/auxlib/ --recursive 1 2 3 #!/bin/sh # put json serde lib to hive classpath sudo aws s3 cp s3 : //S3_BUCKET/emr_deployment/hivelibs /usr/lib/hive/auxlib/ --recursive

Is required to create Hive tables and execute queries over raw Json data in s3 storage.

Afterall, my local deployment folder looks like

emr_deployment / hivelibs / json-serde-1.1.9.9-Hive13-jar-with-dependencies.jar hivescripts / aggregate_script.hql create_tables.hql bootstrap.sh 1 2 3 4 5 6 7 emr_deployment / hivelibs / json - serde - 1.1.9.9 - Hive13 - jar - with - dependencies . jar hivescripts / aggregate_script . hql create_tables . hql bootstrap . sh

Let’s put it to s3:

aws s3 cp emr_deployment\ s3://S3_BUCKET_NAME/emr_deployment/ --recursive 1 aws s3 cp emr_deployment \ s3 : / / S3_BUCKET_NAME / emr_deployment / -- recursive

Create cluster and submit job

Almost there.. To submit some job to the cluster we need to prepare steps (units of work for EMR)

Prepare step to create Hive tables:

Type=HIVE,Name=create_tables,ActionOnFailure=TERMINATE_CLUSTER, Args=[-f,"$s3_path/hivescripts/create_tables.hql"] 1 Type = HIVE , Name = create_tables , ActionOnFailure = TERMINATE_CLUSTER , Args = [ - f , "$s3_path/hivescripts/create_tables.hql" ]

Create aggregation step:

Type=HIVE,Name=aggregate_data,ActionOnFailure=TERMINATE_CLUSTER, Args=[-f,"$s3_path/hivescripts/aggregate_script.hql",--hiveconf,LAST_SUCCESS_DAY=20160101] 1 Type = HIVE , Name = aggregate_data , ActionOnFailure = TERMINATE_CLUSTER , Args = [ - f , "$s3_path/hivescripts/aggregate_script.hql" , -- hiveconf , LAST_SUCCESS_DAY = 20160101 ]

Pay attention to the substitution value for LAST_SUCCESS_DAY variable . Is set to 20160101 just for cluster testing. It will be replaced with real values in Amazon Data Pipeline.

Finally, command to setup and run EMR cluster:

export s3_root=S3_BUCKET export s3_path=$s3_root/emr_deployment aws emr create-cluster --name "Twitter Analytics" --use-default-roles \ --release-label "emr-4.6.0" --instance-type m1.medium --instance-count 3 \ --applications Name=Hive --ec2-attributes KeyName=YOUR_KEYPAIR_NAME \ --bootstrap-action Path="$s3_path/bootstrap.sh" \ --steps Type=HIVE,Name=create_tables,ActionOnFailure=TERMINATE_CLUSTER, \ Args=[-f,"$s3_path/hivescripts/create_tables.hql"] \ Type=HIVE,Name=aggregate_data,ActionOnFailure=TERMINATE_CLUSTER, \ Args=[-f,"$s3_path/hivescripts/aggregate_script.hql",--hiveconf,LAST_SUCCESS_DAY=20160101] \ --log-uri $s3_root/logs/ --auto-terminate 1 2 3 4 5 6 7 8 9 10 11 export s3_root = S3_BUCKET export s3_path = $ s3_root / emr_deployment aws emr create - cluster -- name "Twitter Analytics" -- use - default - roles \ -- release - label "emr-4.6.0" -- instance - type m1 . medium -- instance - count 3 \ -- applications Name = Hive -- ec2 - attributes KeyName = YOUR_KEYPAIR _ NAME \ -- bootstrap - action Path = "$s3_path/bootstrap.sh" \ -- steps Type = HIVE , Name = create_tables , ActionOnFailure = TERMINATE_CLUSTER , \ Args = [ - f , "$s3_path/hivescripts/create_tables.hql" ] \ Type = HIVE , Name = aggregate_data , ActionOnFailure = TERMINATE_CLUSTER , \ Args = [ - f , "$s3_path/hivescripts/aggregate_script.hql" , -- hiveconf , LAST_SUCCESS_DAY = 20160101 ] \ -- log - uri $ s3_root / logs / -- auto - terminate

Is important to specify –auto-terminate, since is turned off by default for the clusters created from CLI. If you omit this flag your cluster will keep running after all steps executed. I’ve specified this command for Linux. Just put it all into single line if you want to run on Windows.

This is how my DynamoDB table looks like after cluster termination:

In Part 2 of this post i will cover Data Pipeline configuration to schedule this processing on daily basis.