Introduction

Prerequisites



Create a Twitter Application

Extract the Tweets

Define the variables

# Enter the HDInsight cluster name

$clusterName = "chervinehadoop" # Enter the OAuth information for your Twitter application. These information in obtained in part 1 $oauth_consumer_key = ""; $oauth_consumer_secret = ""; $oauth_token = ""; $oauth_token_secret = "";

# Path to save the Tweets on the Azure Blob Storage

$destBlobName = "Tweets/MRUTweets.txt"

# This script gets the tweets containing these keywords.

$trackString = "Mauritius" $track = [System.Uri]::EscapeDataString($trackString);



$lineMax = 24

Connect to an Azure Account

Add - AzureAccount



If the authentication is successful, your ID and subscriptions should be displayed in PowerShell.





Create a Blob Storage where the tweets shall be saved

$myCluster = Get - AzureHDInsightCluster - Name $clusterName $storageAccountName = $myCluster.DefaultStorageAccount.StorageAccountName.Replace( ".blob.core.windows.net" , "") $containerName = $myCluster.DefaultStorageAccount.StorageContainerName $storageAccountKey = get - azurestoragekey $storageAccountName | % {$_.Primary} $storageConnectionString = "DefaultEndpointsProtocol=https;AccountName=$storageAccountName;AccountKey=$storageAccountKey" $storageAccount = [Microsoft.WindowsAzure.Storage.CloudStorageAccount]::Parse($storageConnectionString) $storageClient = $storageAccount.CreateCloudBlobClient(); $storageContainer = $storageClient.GetContainerReference($containerName) $destBlob = $storageContainer.GetBlockBlobReference($destBlobName)

Build the OAuth Connection String to connect to twitter

$oauth_nonce = [System.Convert]::ToBase64String([System.Text.Encoding]::ASCII.GetBytes([System.DateTime]::Now.Ticks.ToString())); $ts = [System.DateTime]::UtcNow - [System.DateTime]::ParseExact( "01/01/1970" , "dd/MM/yyyy" , $null) $oauth_timestamp = [System.Convert]::ToInt64($ts.TotalSeconds).ToString(); $signature = "POST&" ; $signature + = [System.Uri]::EscapeDataString( "https://stream.twitter.com/1.1/statuses/filter.json" ) + "&" ; $signature + = [System.Uri]::EscapeDataString( "oauth_consumer_key=" + $oauth_consumer_key + "&" ); $signature + = [System.Uri]::EscapeDataString( "oauth_nonce=" + $oauth_nonce + "&" ); $signature + = [System.Uri]::EscapeDataString( "oauth_signature_method=HMAC-SHA1&" ); $signature + = [System.Uri]::EscapeDataString( "oauth_timestamp=" + $oauth_timestamp + "&" ); $signature + = [System.Uri]::EscapeDataString( "oauth_token=" + $oauth_token + "&" ); $signature + = [System.Uri]::EscapeDataString( "oauth_version=1.0&" ); $signature + = [System.Uri]::EscapeDataString( "track=" + $track); $signature_key = [System.Uri]::EscapeDataString($oauth_consumer_secret) + "&" + [System.Uri]::EscapeDataString($oauth_token_secret); $hmacsha1 = new - object System.Security.Cryptography.HMACSHA1; $hmacsha1.Key = [System.Text.Encoding]::ASCII.GetBytes($signature_key); $oauth_signature = [System.Convert]::ToBase64String($hmacsha1.ComputeHash([System.Text.Encoding]::ASCII.GetBytes($signature))); $oauth_authorization = 'OAuth ' ; $oauth_authorization + = 'oauth_consumer_key="' + [System.Uri]::EscapeDataString($oauth_consumer_key) + '",' ; $oauth_authorization + = 'oauth_nonce="' + [System.Uri]::EscapeDataString($oauth_nonce) + '",' ; $oauth_authorization + = 'oauth_signature="' + [System.Uri]::EscapeDataString($oauth_signature) + '",' ; $oauth_authorization + = 'oauth_signature_method="HMAC-SHA1",' $oauth_authorization + = 'oauth_timestamp="' + [System.Uri]::EscapeDataString($oauth_timestamp) + '",' $oauth_authorization + = 'oauth_token="' + [System.Uri]::EscapeDataString($oauth_token) + '",' ; $oauth_authorization + = 'oauth_version="1.0"' ; $post_body = [System.Text.Encoding]::ASCII.GetBytes( "track=" + $track);

Read the tweets

Write - Host "signature= " + $signature [System.Net.HttpWebRequest] $request = [System.Net.WebRequest]::Create( "https://stream.twitter.com/1.1/statuses/filter.json" ); $request.Method = "POST" ; $request.Headers.Add( "Authorization" , $oauth_authorization); $request.ContentType = "application/x-www-form-urlencoded" ; $body = $request.GetRequestStream(); $body.write($post_body, 0 , $post_body.length); $body.flush(); $body.close(); $response = $request.GetResponse() ; $memStream = New - Object System.IO.MemoryStream $writeStream = New - Object System.IO.StreamWriter $memStream $sReader = New - Object System.IO.StreamReader($response.GetResponseStream()) $inrec = $sReader.ReadLine() $count = 0 while (($inrec - ne $null) - and ($count - le $lineMax)) { if ($inrec - ne "") { Write - Host "`n`t $count tweets received." - ForegroundColor Yellow $writeStream.WriteLine($inrec) $count + + } $inrec = $sReader.ReadLine() }

$writeStream.Flush() $memStream.Seek( 0 , "Begin" ) $destBlob.UploadFromStream($memStream) $sReader.close()

At this stage, the file has been uploaded to the blob storage in the default container at path Tweets/MRUTweets.txt







The file may also be downloaded to view its contents.









Process the Tweets using Hive

1. Open the Query Console

2. Create Staging table RAW_TWEETS

DROP TABLE IF EXISTS RAW_TWEETS; - - create the raw Tweets table on json formatted twitter data CREATE EXTERNAL TABLE RAW_TWEETS(json_response STRING) STORED AS TEXTFILE LOCATION 'wasb://chervinehadoop@chervinestoragehadoop.blob.core.windows.net/Tweets/' ;

SELECT * FROM RAW_TWEETS;

3. Create table TWEETS

DROP TABLE IF EXISTS TWEETS; CREATE TABLE TWEETS( id BIGINT, created_at STRING, created_at_date STRING, created_at_year STRING, created_at_month STRING, created_at_day STRING, created_at_time STRING, in_reply_to_user_id_str STRING, text STRING, contributors STRING, retweeted STRING, truncated STRING, coordinates STRING, source STRING, retweet_count INT, url STRING, hashtags array<STRING>, user_mentions array<STRING>, first_hashtag STRING, first_user_mention STRING, screen_name STRING, name STRING, followers_count INT, listed_count INT, friends_count INT, lang STRING, user_location STRING, time_zone STRING, profile_image_url STRING, json_response STRING);

4. Load table TWEETS

FROM RAW_TWEETS INSERT OVERWRITE TABLE TWEETS SELECT CAST(get_json_object(json_response, '$.id_str' ) as BIGINT), get_json_object(json_response, '$.created_at' ), CONCAT(SUBSTR (get_json_object(json_response, '$.created_at' ), 1 , 10 ), ' ' , SUBSTR (get_json_object(json_response, '$.created_at' ), 27 , 4 )), SUBSTR (get_json_object(json_response, '$.created_at' ), 27 , 4 ), CASE SUBSTR (get_json_object(json_response, '$.created_at' ), 5 , 3 ) WHEN 'Jan' then '01' WHEN 'Feb' then '02' WHEN 'Mar' then '03' WHEN 'Apr' then '04' WHEN 'May' then '05' WHEN 'Jun' then '06' WHEN 'Jul' then '07' WHEN 'Aug' then '08' WHEN 'Sep' then '09' WHEN 'Oct' then '10' WHEN 'Nov' then '11' WHEN 'Dec' then '12' end, SUBSTR (get_json_object(json_response, '$.created_at' ), 9 , 2 ), SUBSTR (get_json_object(json_response, '$.created_at' ), 12 , 8 ), get_json_object(json_response, '$.in_reply_to_user_id_str' ), get_json_object(json_response, '$.text' ), get_json_object(json_response, '$.contributors' ), get_json_object(json_response, '$.retweeted' ), get_json_object(json_response, '$.truncated' ), get_json_object(json_response, '$.coordinates' ), get_json_object(json_response, '$.source' ), CAST (get_json_object(json_response, '$.retweet_count' ) as INT), get_json_object(json_response, '$.entities.display_url' ), ARRAY( TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[0].text' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[1].text' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[2].text' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[3].text' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[4].text' )))), ARRAY( TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[0].screen_name' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[1].screen_name' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[2].screen_name' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[3].screen_name' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[4].screen_name' )))), TRIM(LOWER(get_json_object(json_response, '$.entities.hashtags[0].text' ))), TRIM(LOWER(get_json_object(json_response, '$.entities.user_mentions[0].screen_name' ))), get_json_object(json_response, '$.user.screen_name' ), get_json_object(json_response, '$.user.name' ), CAST (get_json_object(json_response, '$.user.followers_count' ) as INT), CAST (get_json_object(json_response, '$.user.listed_count' ) as INT), CAST (get_json_object(json_response, '$.user.friends_count' ) as INT), get_json_object(json_response, '$.user.lang' ), get_json_object(json_response, '$.user.location' ), get_json_object(json_response, '$.user.time_zone' ), get_json_object(json_response, '$.user.profile_image_url' ), json_response;





SELECT * FROM TWEETS

5. Create Summarized Hive table for Analysis

i. View top users who are tweeting about Mauritius

DROP TABLE IF EXISTS topusers; - - create the topusers hive table by selecting from the HDISample_Tweets table CREATE TABLE IF NOT EXISTS topusers(name STRING, screen_name STRING, tweet_count INT); INSERT OVERWRITE TABLE topusers SELECT name, screen_name, count( 1 ) as cc FROM TWEETS WHERE UPPER(text) LIKE '%MAURITIUS%' GROUP BY name, screen_name;

ii. View from which region people are tweeting about Mauritius

DROP TABLE IF EXISTS topregion; - - create the topusers hive table by selecting from the HDISample_Tweets table CREATE TABLE IF NOT EXISTS topregion (region STRING, tweet_count INT); INSERT OVERWRITE TABLE topregion select user_location, count( 1 ) from TWEETS group by user_location

SELECT * FROM topregion

Analyze the results in Microsoft Excel

1. Install the Microsoft HDInsight Hive Driver.

2. Configure the Data Source





3. View the result on Excel

Conclusion

Social Media/ Big data analytics much more easier and affordable.

References

Social Media Analytics is one of the major uses of Big Data. This article demonstrates how to use Hive in Azure HDInsight to analyse Twitter feeds.In this article, feeds about "Mauritius" shall be extracted and Analysed using Azure HDInsight.The article explains the whole process of analyzing Tweets using Hive, from gathering the Tweets to viewing them in Microsoft Excel.To create an Application on Twitter. This allows you to use the Twitter APIs, read data from Twitter and also post tweets if required.To create your first Twitter Application1. Go to https://dev.twitter.com/apps 2. Click on Create New App3. Fill in the required details and create the AppOnce you app is created, Navigate totokens in your Application, this is where the keys to read data from Twitter can be obtained.The tweets shall be extracted using a PowerShell script, uploaded to Azure Storage before being processed by Azure HDInsight.Below are the steps to extract the tweets and save them to an Azure Blob storage.In this step, all the variables used in the PowerShell Script is defined.This will open an interface to capture your login credentialsSave the tweets to the blob storageThe below steps describes how to read the Tweets from the Blob Storage and analyse them using Hive on HDInsight.From your Hadoop Cluster, click on Query Console.Fill in your credentials and go to the Hive EditorLoad all the data from the file in table RAW_TWEETS.The location above should point to the same path where the tweets are saved on the Blob Storage.Below are the contents of the table RAW_TWEETS.This is where processed (parsed) JSON twitter data will be stored.Parse the JSON tweets from table RAW_TWEETS and store them into table TWEETS.View the data into the TWEETS tableThe result can also be downloaded and viewed locally.View the data in the topregion table.The driver can be found at the following locationNote: It was noticed that both the 32 and 64 bit versions for it to work.Open theby clickingIf you're prompted for an administrator password or confirmation, type the password or provide confirmation.Under, click add and select,Fill in the required information as below. Of course include your password.a.In Excel, go to the Data tab, From other sources, Microsoft Queryb.Select your data sourcec.Add your required tablesd.View your data in ExcelThis article focused on demonstrating how Twitter feeds can be analysed using Hive and HDInsight. However, the Analysis does not end here this can be enhanced to discover lots of information about the customers of a company.Imagine a company extracting twitter feeds about its products, retrieve the data into it's data warehouse and link the twitter data to its existing customer base, the possibilities of customer information here is of a really high scope, from discovering where customers are having negative views on its products, hence doing more advertising to understanding which customers needs which products.Social Media analytics is definitely crucial to understand customer behaviors nowadays.Moreover, having technologies like HDInsight where everything is managed by Azure and the user just focus on the business aspects makes1. https://azure.microsoft.com/en-us/documentation/articles/hdinsight-analyze-twitter-data/ 2. http://windows.microsoft.com/en-us/windows/using-odbc-data-source-administrator