Preparing the Data¶

With our Hadoop cluster up and running, we can move the reddit comment data from Amazon S3 to HDFS.

We can SSH into the head node of the cluster and run the following command with valid AWS credentials, which will transfer the reddit comment data (975 GB of JSON data) from a public Amazon S3 bucket to the HDFS data store on the cluster:

hadoop distcp s3n://{{ AWS_KEY }}:{{ AWS_SECRET }}@blaze-data/reddit/json/*/*.json /user/ubuntu

The time required to move the data from Amazon S3 to HDFS was about 1 hour and 45 minutes.

We can run the following command on the head node to download the JSON serializer/deserializer (SerDe) module to enable Hive to read and write in JSON format:

$ wget http://s3.amazonaws.com/elasticmapreduce/samples/hive-ads/libs/jsonserde.jar

On the head node, we open the interactive Hive shell using the hive command, load the JSON SerDe module, create a table, and load the data into Hive:

$ hive hive > ADD JAR jsonserde.jar; hive > CREATE TABLE reddit_json ( archived boolean, author string, author_flair_css_class string, author_flair_text string, body string, controversiality int, created_utc string, distinguished string, downs int, edited boolean, gilded int, id string, link_id string, name string, parent_id string, removal_reason string, retrieved_on timestamp, score int, score_hidden boolean, subreddit string, subreddit_id string, ups int ) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ('paths'='archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,edited,gilded,id,link_id,name,parent_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_id,ups'); hive > LOAD DATA INPATH '/user/ubuntu/*.json' INTO TABLE reddit_json;

The time required to load the data into Hive was less than 1 minute.

Once the data is loaded in Hive, we can query the data using SQL statements such as SELECT count(*) FROM reddit_json; , however, the responses will be fairly slow because the data is in JSON format.

Alternatively, we can migrate the data to Parquet format. Apache Parquet is a columnar data store that was designed for HDFS and performs very well in many cases.

We can use the following commands in the interactive Hive shell to create a new table and convert the data to Parquet format:

hive > CREATE TABLE reddit_parquet ( archived boolean, author string, author_flair_css_class string, author_flair_text string, body string, controversiality int, created_utc string, distinguished string, downs int, edited boolean, gilded int, id string, link_id string, name string, parent_id string, removal_reason string, retrieved_on timestamp, score int, score_hidden boolean, subreddit string, subreddit_id string, ups int, created_utc_t timestamp ) STORED AS PARQUET; hive > SET dfs.block.size=1g; hive > INSERT OVERWRITE TABLE reddit_parquet select *, cast(cast(created_utc as double) as timestamp) as created_utc_t FROM reddit_json;

The time required to convert the data to Parquet format was about 50 minutes.

Note that we added a new column in timestamp format ( created_utc_t ) based on the original created_utc column. The original column was a string of numbers (timestamp), so first we cast this to a double and then we cast the resulting double to a timestamp .

Finally, we SSH into one of the compute nodes and execute the following command from the interactive Impala shell to update the tables from the Hive metastore.