Updated 2014 November 26th to reflect changes in the tm package

Updated 2015 February 18th to reflect changes in the twitteR package

A short post on using the R twitteR package for text mining and using the R wordcloud package for visualisation. I did this on my Windows machine, which has this problem. I’ve updated the code due to changes in the recent update of the twitteR package. In addition, I have included a screenshot below from my Twitter Apps Keys and Access Tokens page to indicate where to get the consumer_key, consumer_secret, access_token, and access_secret values.

Please enter your values in the code below.

#install the necessary packages install.packages("twitteR") install.packages("wordcloud") install.packages("tm") library("twitteR") library("wordcloud") library("tm") #necessary file for Windows download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem") #to get your consumerKey and consumerSecret see the twitteR documentation for instructions consumer_key <- 'your key' consumer_secret <- 'your secret' access_token <- 'your access token' access_secret <- 'your access secret' setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) #the cainfo parameter is necessary only on Windows r_stats <- searchTwitter("#Rstats", n=1500, cainfo="cacert.pem") #should get 1500 length(r_stats) #[1] 1500 #save text r_stats_text <- sapply(r_stats, function(x) x$getText()) #create corpus r_stats_text_corpus <- Corpus(VectorSource(r_stats_text)) #clean up r_stats_text_corpus <- tm_map(r_stats_text_corpus, content_transformer(tolower)) r_stats_text_corpus <- tm_map(r_stats_text_corpus, removePunctuation) r_stats_text_corpus <- tm_map(r_stats_text_corpus, function(x)removeWords(x,stopwords())) wordcloud(r_stats_text_corpus) #alternative steps if you're running into problems r_stats<- searchTwitter("#Rstats", n=1500, cainfo="cacert.pem") #save text r_stats_text <- sapply(r_stats, function(x) x$getText()) #create corpus r_stats_text_corpus <- Corpus(VectorSource(r_stats_text)) #if you get the below error #In mclapply(content(x), FUN, ...) : # all scheduled cores encountered errors in user code #add mc.cores=1 into each function #run this step if you get the error: #(please break it!)' in 'utf8towcs' r_stats_text_corpus <- tm_map(r_stats_text_corpus, content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), mc.cores=1 ) r_stats_text_corpus <- tm_map(r_stats_text_corpus, content_transformer(tolower), mc.cores=1) r_stats_text_corpus <- tm_map(r_stats_text_corpus, removePunctuation, mc.cores=1) r_stats_text_corpus <- tm_map(r_stats_text_corpus, function(x)removeWords(x,stopwords()), mc.cores=1) wordcloud(r_stats_text_corpus)

I learned who Hadley Wickham is after seeing this

Let’s try the hash tag #bioinformatics

bioinformatics <- searchTwitter("#bioinformatics", n=1500, cainfo="cacert.pem") bioinformatics_text <- sapply(bioinformatics, function(x) x$getText()) bioinformatics_text_corpus <- Corpus(VectorSource(bioinformatics_text)) bioinformatics_text_corpus <- tm_map(bioinformatics_text_corpus, content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), mc.cores=1 ) bioinformatics_text_corpus <- tm_map(bioinformatics_text_corpus, content_transformer(tolower), mc.cores=1) bioinformatics_text_corpus <- tm_map(bioinformatics_text_corpus, removePunctuation, mc.cores=1) bioinformatics_text_corpus <- tm_map(bioinformatics_text_corpus, function(x)removeWords(x,stopwords()), mc.cores=1) wordcloud(bioinformatics_text_corpus) #if you're getting the error: #could not be fit on page. It will not be plotted. #try changing the scale, like #wordcloud(bioinformatics_text_corpus, scale=c(2,0.2))

I had a small chuckle on finding the word “ultratricky” on the far left towards the top

Wordcloud with less words, minimum frequency and in colour

library(RColorBrewer) pal2 <- brewer.pal(8,"Dark2") wordcloud(bioinformatics_text_corpus,min.freq=2,max.words=100, random.order=T, colors=pal2)

I guess bioinformatics and genomics go hand in hand. Good to see that people are tweeting on training, workshops and resources for bioinformatics.

Conclusions

There are several posts on the web already on using the R twitteR package for text mining; I found them after I had the idea of making a word cloud using people’s tweets that had a particular hash tag. There are many other uses of the R twitteR package, for example you could run a daily cron job to check which of your followers have unfollowed you and so on.

me <- getUser("davetang31", cainfo="cacert.pem") me$getId() [1] "555580799" getUser(555580799,cainfo="cacert.pem") [1] "davetang31" me$getFollowerIDs(cainfo="cacert.pem") #or me$getFollowers(cainfo="cacert.pem") #you can also see what's trending trend <- availableTrendLocations(cainfo="cacert.pem") head(trend) name country woeid 1 Worldwide 1 2 Winnipeg Canada 2972 3 Ottawa Canada 3369 4 Quebec Canada 3444 5 Montreal Canada 3534 6 Toronto Canada 4118 trend <- getTrends(1, cainfo="cacert.pem") trend name 1 #My15FavoritesSongs 2 #MikaNoLegendários 3 #CitePessoasMaisGostosasDoTwitter 4 #SiMiMamáFueraPresidenta 5 #HappySiwonDay 6 C.J Fair 7 Mewtwo 8 Mitch McGary 9 Carreño 10 John Wall url 1 http://twitter.com/search?q=%23My15FavoritesSongs 2 http://twitter.com/search?q=%23MikaNoLegend%C3%A1rios 3 http://twitter.com/search?q=%23CitePessoasMaisGostosasDoTwitter 4 http://twitter.com/search?q=%23SiMiMam%C3%A1FueraPresidenta 5 http://twitter.com/search?q=%23HappySiwonDay 6 http://twitter.com/search?q=%22C.J+Fair%22 7 http://twitter.com/search?q=Mewtwo 8 http://twitter.com/search?q=%22Mitch+McGary%22 9 http://twitter.com/search?q=Carre%C3%B1o 10 http://twitter.com/search?q=%22John+Wall%22 query woeid 1 %23My15FavoritesSongs 1 2 %23MikaNoLegend%C3%A1rios 1 3 %23CitePessoasMaisGostosasDoTwitter 1 4 %23SiMiMam%C3%A1FueraPresidenta 1 5 %23HappySiwonDay 1 6 %22C.J+Fair%22 1 7 Mewtwo 1 8 %22Mitch+McGary%22 1 9 Carre%C3%B1o 1 10 %22John+Wall%22 1

See also

twitteR vignette

Evaluation of twitteR

Text data mining twitteR

Word cloud in R