Introduction

We had previously done a post on using the rvest web scraping package to create a .csv file comprising of Presidential Inaugural Addresses. This post can be found here.

In this article, we will attempt to create a preliminary analysis of this data using the tidytext package by Julia Silge and David Robinson.

Libraries

library(tidytext) library(tidyverse) library(stringr) library(ggplot2) library(plotly) library(scales) library(knitr) library(DT) library(widyr) library(ggraph) plotTheme <- function(base_size = 12) { theme( text = element_text( color = "black"), plot.title = element_text(size = 10,colour = "black",hjust=0.5), plot.subtitle = element_text(face="italic"), plot.caption = element_text(hjust=0), axis.ticks = element_blank(), panel.background = element_blank(), panel.grid.major = element_line("grey80", size = 0.1), panel.grid.minor = element_blank(), strip.background = element_rect(fill = "grey80", color = "white"), strip.text = element_text(size=8), axis.title = element_text(size=5), axis.text = element_text(size=5), axis.title.x = element_text(hjust=1), axis.title.y = element_text(hjust=1), plot.background = element_blank(), legend.background = element_blank(), legend.title = element_text(colour = "black", face = "bold"), legend.text = element_text(colour = "black", face = "bold")) }

Importing Data

inaug <- read.csv("inaug_speeches.csv",header=T,stringsAsFactors = F) dim(inaug) names(inaug)

The data is made of 58 observations and 5 variables.

Extracting the Years

inaug$year <- as.numeric(str_sub(inaug$Date,start = -4, end=-1))

Number of Words Used by each president

inaug %>% unnest_tokens(word,text) %>% group_by(year,Name) %>% summarise(n_words=n()) %>% mutate(mean_words=mean(n_words)) %>% ggplot(aes(x=year,y=(n_words)))+geom_bar(stat = "identity",width=0.5,position=position_dodge(0.7))+theme(axis.text.x = element_text(vjust=1,angle=90))+plotTheme()+geom_text(aes(label=Name), vjust=0,angle=90,size=2.5,hjust=0)+ylim(c(0,20000))+labs(title="How do the speech lengths change by year?",caption="William Henry Harrison had the longest speech")

Highly influential Presdents like President Lincoln, President John F Kennedy, President Theodore Roosevelt , President FDR had short speeches.

President William Henry Harrison had the longest speech.

According to the US News, the worst Presidents include President Herbert Hoover, President William Henry Harrison, President Ulysses S.Grant,President John Tyler. From the above plot, we observe that most of these Presidents had long speeches

Lexical Diversity

Lexical Diversity is a measure of the vocabulary of a set of text. Higher the lexical diversity, higher is the vocabulary.

inaug %>% unnest_tokens(word,text) %>% #nomrlizing the text mutate(word=tolower(word))%>% # counting the number of unique words group_by(year,Name) %>% summarise(lex_div = length(unique(word))/length(word)) %>% ggplot(aes(x=as.numeric(year),y=lex_div))+geom_line()+geom_point()+theme(axis.text.x = element_text(vjust=1,angle=90))+labs(x="Year",y="Lexical Diversity",title="Lexical Diversity Through the Years")+plotTheme()+geom_text(aes(label=Name),hjust=-0.1, vjust=0,size=2,angle=90,check_overlap = T)+ylim(c(0,0.9))

Almost all Presidential Inaugural Addresses have lexical diversity , less than 0.5

The speech with the highest lexical diversity was George Washington’s second Presidential Inaugural address.This was also the shortest speech with just 135 words.

Lexical Diversity and Number of Words

How do the lexical diversity of inaugural addresses change with the number of words used?

inaug %>% unnest_tokens(word,text) %>% #nomrlizing the text mutate(word=tolower(word))%>% # counting the number of unique words group_by(year,Name) %>% summarise(lex_div=length(unique(word))/length(word), n_words = length(word)) %>% ggplot(aes(x=n_words,y=lex_div))+geom_point(alpha=1/3,aes(size=lex_div))+geom_smooth(se=F)+plotTheme()+labs(title="Plot Shows the Variation Between Lexical Diversity and Length of Words ")

Using ggplot2 geom_smooth() function allows us to see the relationship clearly.

As the number of words increases, the lexical diversity decreasesr

Linear Approximation

How can we describe the relationship between number of words and lexical diversity using a simple linear model?

To achieve this we will use the lm() function.

inaug_word_length <- inaug %>% unnest_tokens(word,text) %>% #nomrlizing the text mutate(word=tolower(word))%>% # counting the number of unique words group_by(year,Name) %>% summarise(lex_div=length(unique(word))/length(word), n_words = length(word)/1000) summary(lm(lex_div~n_words,data=inaug_word_length))

The linear model then produces the equation,

$lexical_=diversity=-0.038043*(n_=words)+0.4451738$

If the number of words is zero, the lexical diversity is around 44.5%. This is impractical.

For every 1000 word increase, the lexical diversity decreases by around 3%

This model explains around 60% of variability in the data. This is the adjusted R squared metric.

Ratio of Percentage Usage of "We" to Percentage Usage of "I"

I spend a lot of time on Youtube and sometimes watch videos that are informative. There is an insightful study done by Vox on the usage of I and We in Presidential Addresses.

The usage of We helps get the message across for unity.

In this section, we calculate the ratio of We's to I's.

To avoid undefined values E.g divide by zero, the logistic function will be used to map values to (0,1).

If the number is close to 1, higher is the ratio.

<br />library(reshape2) invlogit <- function(x){ 1/(1+exp(-x)) } t <- inaug %>% unnest_tokens(word,text) %>% mutate(word=tolower(word)) %>% group_by(year,Name,word) %>% summarise(n=n()) %>% mutate(percent=n/sum(n)) %>% filter(word=="we"|word=="i") %>% dcast(year~word,value.var="percent") t1 <- inaug %>% unnest_tokens(word,text) %>% mutate(word=tolower(word)) %>% group_by(year,Name,word) %>% summarise(n=n()) %>% mutate(percent=n/sum(n)) %>% filter(word=="we"|word=="i") %>% select(year,Name) inaug_i_we <- t %>% inner_join(t1,by="year") %>% unique() # replacing NA with zeros inaug_i_we[is.na(inaug_i_we)] <- 0 library(plotly) p <- inaug_i_we %>% mutate(ratio=invlogit(we/i)) %>% plot_ly(x=~year,y=~ratio,type="bar",text=~paste(Name))%>%layout(xaxis=list(title="Year"), yaxis=list(title="Logit Ratio of %We to %I"),title="Logit Ratio of %We to %I")

T

The plotly library is used here to create an interactive plot.

library is used here to create an interactive plot. The .png format screenshot of the plot is used here.

This library is an R – wrapper around the plotly.js javascript library.

Theodore Roosevelt,Franklin D. Roosevelt,Dwight Eisenhower, Bill Clinton,Barack Obama and Donald J Trump had the highest logit percentage ratio usage of We to I

The interactive plot can be seen here

The lowest was George Washington’s second Inaugural Address. This is because, his second speech was just 135 words long.

Sentiment Scores during Tough Times

How do the sentiments of each Inaugural Address change with time?

To access sentiment lexicons, we will use the sentiments data frame made available through the tidytext package.

data frame made available through the package. The sentiment for each speech is calculated as the sum of sentiment scores for each word.

We will use the AFINN lexicon.

sent_scores <- sentiments %>% filter(lexicon=="AFINN") inaug%>% unnest_tokens(word,text) %>% inner_join(sent_scores,by="word") %>% select(-c(sentiment,lexicon)) %>% group_by(year,Name) %>% summarise(score=sum(score)) %>% ggplot(aes(x=year,y=score))+geom_point()+ geom_line()+ylim(c(-10,500))+plotTheme()+labs(title="Sentiment Scores of Speeches by Year")+geom_text(aes(1861,0),label="Start of Civil War",show.legend = F,hjust=-1,vjust=1,angle=90,inherit.aes = F)+geom_vline(xintercept = 1861,linetype=2)+ geom_text(aes(1865,0),label="End of Civil War",show.legend = F,hjust=-1,vjust=1,angle=90,inherit.aes = F)+geom_vline(xintercept = 1865,linetype=2)+geom_text(aes(1929,0),label="Great Depression",show.legend = F,hjust=-1,vjust=1,,angle=90,inherit.aes = F)+geom_vline(xintercept = 1929,linetype=2)+geom_text(aes(1941,0),label="WW2",show.legend = F,hjust=-1,vjust=1,angle=90,inherit.aes = F)+geom_vline(xintercept = 1941,linetype=2)+geom_text(aes(2001,0),label="9/11",show.legend = F,hjust=-1,vjust=1,angle=90,inherit.aes = F)+geom_vline(xintercept = 2001,linetype=2)+geom_text(aes(1963,0),label="JFK Assassination",show.legend = F,hjust=-1,vjust=1,angle=90,inherit.aes = F)+geom_vline(xintercept = 1963,linetype=2)+geom_text(aes(1914,0),label="WWI",show.legend = F,hjust=-1,vjust=1,angle=90,inherit.aes = F)+geom_vline(xintercept = 1914,linetype=2)+geom_text(aes(2008,0),label="Wall Street Crash",show.legend = F,hjust=-1,vjust=1,angle=90,inherit.aes = F)+geom_vline(xintercept = 2008,linetype=2)

Using ggplot2 , we plot a line graph that depicts the sentiment scores for each Presidential Inaugural Address. The plot is complimented by important events in US history. The sentiment scores were the lowest during the period of the Civil War.

, we plot a line graph that depicts the sentiment scores for each Presidential Inaugural Address. The plot is complimented by important events in US history. The sentiment scores were the lowest during the period of the Civil War. The Great Depression led to the fall in sentiment scores. This was the worst crisis that had faced America after the Civil War.

Scores went down further during the Second World War

Percentage Sentiment for each Presidential Address

We will use the nrc lexicon available in the sentiments data frame.

lexicon available in the data frame. The gganimate package allows us to animate plots frame by frame

package allows us to animate plots frame by frame The tweenr package is used for smooth transitions here.

library(gganimate) library(tweenr) library(ggthemes) sent <- sentiments %>% filter(lexicon=="nrc") sent_by_year <- inaug %>% unnest_tokens(word,text) %>% inner_join(sent,by="word") %>% group_by(year,Name,sentiment) %>% summarise(n=n()) %>% mutate(n=n/sum(n)) tw_df<-sent_by_year %>% ungroup() %>% mutate(sentiment=as.factor(sentiment),Name=as.factor(Name)) %>% split(.$year) %>% tween_states(tweenlength=2,statelength=3,ease=rep('cubic-in-out',3),nframes=150) ###################################################################################################### library(animation) oopt = ani.options(interval = 0.15) saveGIF({for (i in 1:max(tw_df$.frame)) { temp <- subset(tw_df,.frame==i) g_bar <- ggplot(data=temp ,aes(x=sentiment,y=n,fill=sentiment))+ geom_bar(stat="identity",alpha=0.4)+labs(title=paste("Percentage Sentiments for President",temp[1,]$Name),subtitle=temp$year)+geom_text(aes(label=paste(round(n*100,2),"%")), vjust=1.5, colour="black",size=2.5)+plotTheme()+ scale_fill_manual(values=c("#24576D", "#A113E2","#000000", "#D91460","#28AADC","#40cc49","#F2583F", "#96503F","#ffc100","#918d58"))+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank())+scale_y_continuous(limits=c(0,0.5)) print(g_bar) print(paste(i,"out of",max(tw_df$.frame))) ani.pause()} },movie.name="letsee.gif",ani.width = 650, ani.height = 600)

Conclusion