Contents

Getting Twitter Data

struct

load creds.mat connection = twitter(consumerKey,consumerSecret,accessToken,accessTokenSecret); response = search(connection, 'mathworks' , 'count' ,100, 'lang' , 'en' )

response = ResponseMessage with properties: StatusLine: 'HTTP/1.1 200 OK' StatusCode: OK Header: [1×26 matlab.net.http.HeaderField] Body: [1×1 matlab.net.http.MessageBody] Completed: 0

tweets = cellfun(@(x) extractBefore(string(x.text),50), response.Body.Data.statuses); tweets(1:3)

ans = 3×1 string array "listening to NPR and there's a mathworks ad? :thi" "RT @MetrowestSTEM: Thank you @MathWorks, Underwri" "Basics of eigenvalues and eigenvectors with MIT P"

'from:mathworks'

%3A

pctencode = @(str) replace(char(java.net.URLEncoder.encode(str, 'UTF-8' ) ), '+' , '%20' ); st = pctencode( 'from:mathworks' )

st = 'from%3Amathworks'

response = search(connection,st, 'count' ,100, 'lang' , 'en' ); tweets = cellfun(@(x) extractBefore(string(x.text),50), response.Body.Data.statuses); tweets(1:3)

ans = 3×1 string array "#HardTech Is it possible to add EtherCAT Master t" "Uses #drones and kites as #renewable energy sourc" "Are We Taking the “U” Out of UX? - What is a UX d"

There Was More to Defining Popularity Than I Expected

data = readtable( 'someTrumpTweets.csv' , 'TextType' , 'string' ); data = table2timetable(data);

subplot(2,1,1) plot(data.whenTweeted,[data.retweetsNo,data.likesNo], 'o' ) legend( 'Number of Retweets' , 'Number of Likes' , 'Location' , 'northwest' ) axis( 'tight' ) subplot(2,1,2) plot(data.retweetsNo,data.likesNo, 'o' ) xlabel( 'Number of Retweets' ) ylabel( 'Number of Likes' )

removevars

data = removevars(data, 'retweetsNo' );

groupsummary

byHour = groupsummary(data(:, 'likesNo' ), 'whenTweeted' , 'hourofday' ); figure bar(0:23,byHour.GroupCount) xlabel( 'Hour of the Day' ) ylabel( 'Total Number of Tweets' ) title( 'Number of Tweets By Hour of the Day' )

data.wkEnd = isweekend(data.whenTweeted); boxplot(data.likesNo,data.wkEnd, 'Labels' ,{ 'Weekday' 'Weekend' }) ylabel( 'Number of Likes' ) title( 'Weekday vs. Weekend' )

ttest2(data.likesNo(data.wkEnd),data.likesNo(~data.wkEnd))

ans = 1

What are all the tweets about?

'DetectPatterns'

tweetDocuments = tokenizedDocument(data.tweets, 'DetectPatterns' ,{ 'at-mention' , 'hashtag' , 'web-address' }); tweetDeets = tokenDetails(tweetDocuments); tweetDeets(37:39,:)

ans = 3×4 table Token DocumentNumber LineNumber Type ________ ______________ __________ ___________ "Jersey" 2 1 letters "." 2 1 punctuation "#MAGA" 2 1 hashtag

hashtags = tweetDeets(tweetDeets.Type == 'hashtag' ,:); wordcloud(hashtags.Token);

replaceHashtags

cleanDocs = lower(tweetDocuments); cleanDocs = replaceHashtags(cleanDocs);

cleanDocs = erasePunctuation(cleanDocs); cleanDocs = removeWords(cleanDocs,stopWords); cleanDocs = removeShortWords(cleanDocs,2); bag = bagOfNgrams(cleanDocs, 'NgramLengths' ,3); w = wordcloud(bag);

And which topics are the most popular?

bag2Model = bagOfWords(cleanDocs); rng(123) numTopics = 18; mdl = fitlda(bag2Model,numTopics, 'Verbose' ,0); [~,topics] = max(mdl.DocumentTopicProbabilities,[],2); data.topics = topics;

p = mdl.DocumentTopicProbabilities; maxs = max(p,[],2); cutoff = ... quantile(maxs,.5); dataTrimmed = data(maxs>=cutoff,:);

for ii = 1:numTopics figure wordcloud(dataTrimmed.tweets(dataTrimmed.topics == ii)) ; end

popCats = groupsummary(data(:,{ 'likesNo' , 'topics' }), 'topics' , 'mean' ); load topicNames.mat bar(topicNames, popCats.mean_likesNo); title( 'Popularity by Topics' ) ylabel( 'Number of Likes' )

[~,idx] = sort(popCats.mean_likesNo, 'descend' ); rank = [ "The" "Second" "Third" "Fourth" ]; nGrams = [1 2 1 2]; for ii = 1:4 subplot(2,2,ii) bag = bagOfNgrams(cleanDocs(data.topics == idx(ii)), 'NgramLengths' ,nGrams(ii)); wordcloud(bag); text = rank(ii) + " Most Popular, N-Grams = " + nGrams(ii); title(text) end

fitlda

nGrams = [2 2 1 2]; for ii = 0:3 subplot(2,2,ii+1) bag = bagOfNgrams(cleanDocs(data.topics == idx(end-ii)), 'NgramLengths' ,nGrams(ii+1)); wordcloud(bag); text = rank(ii+1) + " Least Popular with N-Grams = " + nGrams(ii+1); title(text) end

Do you work with text data?

function cleanTweets = replaceHashtags(tweets) cleanTweets = tweets; oldFN = "fakenews" ; newFN = "fake news" ; cleanTweets = replace(cleanTweets,oldFN,newFN); oldHH = string({ "hurricaneharvey" }); newHH = "hurricane harvey" ; cleanTweets = replace(cleanTweets,oldHH,newHH); oldNK = string({ "northkorea" "noko" }); newNK = "north korea" ; cleanTweets = replace(cleanTweets,oldNK,newNK); oldTR = string({ "taxreform" }); newTR = "tax reform" ; cleanTweets = replace(cleanTweets,oldTR,newTR); oldPR = string({ "puerto rico" }); newPR = "puerto rico" ; cleanTweets = replace(cleanTweets,oldPR,newPR); oldPRs = string({ "prstrong" }); newPRs = "puerto rico strong" ; cleanTweets = replace(cleanTweets,oldPRs,newPRs); oldStand = string({ "standforouranthem" }); newStand = "stand for our anthem" ; cleanTweets = replace(cleanTweets,oldStand,newStand); oldMAGA = string({ "maga" }); newMAGA = "make america great again" ; cleanTweets = replace(cleanTweets,oldMAGA,newMAGA); oldAF = string({ "americafirst" }); newAF = "america first" ; cleanTweets = replace(cleanTweets,oldAF,newAF); oldTC = string({ "taxcutsandjobsact" }); newTC = "tax cuts and jobs act" ; cleanTweets = replace(cleanTweets,oldTC,newTC); end

Today I'd like to introduce a guest blogger, Grace Kennedy, who works for the Training Services team here at MathWorks. She has been interested in learning what we can find from Twitter, and she will share techniques she uses in her Twitter data analysis.Hi, I’m Grace Kennedy. I've been fascinated with how President Trump uses Twitter - what he talks about and how people respond to him. So when the Text Analytics Toolbox was introduced in R2017b, I could not wait to use it to work with Twitter data. In this blogpost, I'll show you how to retrieve and work with tweets and investigate what President Trump is tweeting about that is so popular.As of R2017b, you can use MATLAB functions from the Datafeed Toolbox to retrieve tweets. First you will need to sign up for a developer account with Twitter to obtain credentials. I have my developer credentials saved as a MAT-file for privacy. We pass our credentials to thefunction to open a connection to the Twitter API. Then, searching Twitter is an easy one-liner with thefunction. The response is awith information about the individual tweets and the search metadata.Each status contains information about a single tweet, such as the number of likes, when the tweet was posted, and the text of the tweet.I was looking for tweets from a specific user. The syntax for queries that are more advanced than just a search string can be found in the Twitter documentation . This query takes the form. However, we must use percent encoding . So, we need to change the colon to. We can automate percent encoding using a trick from Toshi Takeuchi's February blogpost , which sparked my original interest in using MATLAB for Twitter analysis.Several months and 1033 tweets later, here is what I found.My first thought was to define popularity as the number of retweets and likes a tweet receives. So, as I pulled Twitter responses, I saved these metrics, the date of each tweet, and the tweets themselves.Let's take a quick peek at the numeric data.It turns out that the number of retweets and number of likes are highly correlated, so we can remove the number of retweets. In R2018a, functionality to work more easily with table variables was added, including thefunction.I also looked at whether the time of day or day of the week had any effect on the popularity of a tweet. This was easy with thefunction in R2018a, which enables group summary statistics on timetables. While I didn't find that popularity of the tweets varied by time of day, I did find that President Trump tweets more at some times than others.After about five months of data collection, I realized that I had introduced systematic bias into my data. Tweets that are up longer have had more time to collect likes. While the Twittersphere moves quickly, I observed that weekend posts were getting a lot more likes than weekday ones. This was in part because I collected data several times a day... during the workday. This meant weekend tweets would naturally have more time to accumulate likes. Luckily, I had been stashing my raw data from each pull for reproducibility, so I was able to go back and correct this. I merged datasets, keeping likes from the latest recording available. This is not perfect, but each tweet had a few days to collect likes. After adjusting for the bias, we test for statistical significance usingand still see weekend tweets receive more likes than workday tweets. Perhaps this is because people have more time during the weekend to be active on social media.The president likes to call out individuals, themes, and slogans with hashtags and @ mentions. It is a concise way of linking ideas and people or corporations. As of R2018a, we can identify many of these common elements in text data with theoption for thefunction. Hashtags give us a pretty good idea of what President Trump likes to call out.Not only does looking at the hashtags give us an idea of what President Trump is tweeting about (#taxreform, #fakenews, #maga), but we see some abbreviations that represent common phrases he likes to use. I used the custom function below,, to replace non-word hashtags with their complete phrases (like "MAGA" with his campaign slogan, "Make America Great Again") before visualizing common phrases.As of R2018a, we can create bags of N-grams, which enables us to look at phrases rather than just individual words. I first did a little processing to remove some noise in the form of punctuation or words that don't tell us much about the content.We can sort the tweets into a set number of previously undetermined topics using thefunction from the Text Analytics Toolbox. I chose eighteen as the number of topics using the perplexity metric in the LDA model. For more details, please refer to this example Not every tweet fits perfectly into one of eighteen well-defined topics. To help me label the topics, I removed some of the noise by excluding tweets classified with the lowest confidence.Finally, let's get the average number of likes by topic so we can look at the four most and least popular topics.In the word cloud below, the most popular topic looks like a collection of subjects that includes the FBI investigation and Hillary Clinton. The next three most popular topics are clearer: NFL players kneeling during the national anthem, fake news, and tweets about North Korea.Now let's create word clouds for the four least popular topics defined by thefunction. As we see in the first word cloud below, the topic that received the fewest number of likes included tweets about something being a great honor to President Trump. Some of the "great honors" were to sign bills into law, meet foreign dignitaries, and visit military sites. These tweets weren't exactly controversial and didn't attract as much attention. The second least popular topic included tweets about Hurricane Maria, which could be because people are hesitant to click "like" on a tragedy. Interestingly, it appears President Trump's Twitter followers weren't particularly excited about tweets on the tax overhaul, as we see in the last two word clouds. The third and fourth least popular topics were on this signature legislation, with tweets in the third least popular topic containing more partisan language (dems, republican, party...).I've enjoyed watching my word clouds change daily with the speed of the news, and now, stepping back to analyze the popularity of various topics. I am wondering who else uses MATLAB to work with Twitter data. What are you investigating? What methods do you use to retrieve and analyze data? More generally, what kinds of text data do you come across in your work or hobbies? What insight could topic modeling provide? Let us know here

Get the MATLAB code (requires JavaScript)

Published with MATLAB® R2018a