Earlier today, I saw a post vis the aggregating R-Bloggers service a post on Using Text Mining to Find Out What @RDataMining Tweets are About. The post provides a walktrhough of how to grab tweets into an R session using the twitteR library, and then do some text mining on it.

I’ve been meaning to have a look at pulling Twitter bits into R for some time, so I couldn’t but have a quick play…

Starting from @RDataMiner’s lead, here’s what I did… (Notes: I use R in an R-Studio context. If you follow through the example and a library appears to be missing, from the Packages tab search for the missing library and import it, then try to reload the library in the script. The # denotes a commented out line.)

require(twitteR) #The original example used the twitteR library to pull in a user stream #rdmTweets <- userTimeline("psychemedia", n=100) #Instead, I'm going to pull in a search around a hashtag. rdmTweets <- searchTwitter('#mozfest', n=500) # Note that the Twitter search API only goes back 1500 tweets (I think?) #Create a dataframe based around the results df <- do.call("rbind", lapply(rdmTweets, as.data.frame)) #Here are the columns names(df) #And some example content head(df,3)

So what can we do out of the can? One thing is look to see who was tweeting most in the sample we collected:

counts=table(df$screenName) barplot(counts) # Let's do something hacky: # Limit the data set to show only folk who tweeted twice or more in the sample cc=subset(counts,counts>1) barplot(cc,las=2,cex.names =0.3)

Now let’s have a go at parsing some tweets, pulling out the names of folk who have been retweeted or who have had a tweet sent to them:

#Whilst tinkering, I came across some errors that seemed # to be caused by unusual character sets #Here's a hacky defence that seemed to work... df$text=sapply(df$text,function(row) iconv(row,to='UTF-8')) #A helper function to remove @ symbols from user names... trim <- function (x) sub('@','',x) #A couple of tweet parsing functions that add columns to the dataframe #We'll be needing this, I think? library(stringr) #Pull out who a message is to df$to=sapply(df$text,function(tweet) str_extract(tweet,"^(@[[:alnum:]_]*)")) df$to=sapply(df$to,function(name) trim(name)) #And here's a way of grabbing who's been RT'd df$rt=sapply(df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))

So for example, now we can plot a chart showing how often a particular person was RT’d in our sample. Let’s use ggplot2 this time…

require(ggplot2) ggplot()+geom_bar(aes(x=na.omit(df$rt)))+opts(axis.text.x=theme_text(angle=-90,size=6))+xlab(NULL)

Okay – enough for now… if you’re tempted to have a play yourself, please post any other avenues you explored with in a comment, or in your own post with a link in my comments;-)

Rate this: Share this: Tweet





Like this: Like Loading... Related