Text Analysis of Tim Ferriss Podcast Episodes Using R

I recently came across a great respository of transcripts from Tim Ferriss’ podcast, courtesy of transcripts.io. I don’t actually listen to Ferriss’ podcast (too heavily monetized for my taste), but I know that many do and he gets great guests. I thought I’d analyze the text from the transcripts a bit and ended up using Term Frequency/Inverse Document Frequencys to identify the main topics of each episode.

I used the great Rvest package to scrape the data from the web, and then used tidytext to analyze it. I won’t go into depth on these but there are great resources out there to learn about them; I recommend this post from Justin Law and Jordan Rosenblum for Rvest and this great site from Julia Silge and David Robinson for tidytext.

This is a good introductory web scraping / text analysis project, and is especially easy to start out with because the transcript repository is structured very simply and is straightforward to scrape.

If you have any questions or would like the original data, let me know via email (joe at this website) or twitter.

R Packages Used Load the packages we’ll use: library(rvest) ## Loading required package: xml2 library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library(tidytext) library(ggplot2) library(stringr) library(tidyr) library(lubridate) ## ## Attaching package: 'lubridate' ## The following object is masked from 'package:base': ## ## date library(babynames) library(readr) ## ## Attaching package: 'readr' ## The following object is masked from 'package:rvest': ## ## guess_encoding Using http://transcripts.io for the transcripts, thanks very much for putting that together! We’ll use the rvest package to pull the data into R. I use the SelectorGadget chrome extension to find css elements which is incredibly useful.

Scraping the Data # save the url; this is the page with links to each episode url <- "https://www.transcripts.io/transcripts/tim_ferriss_show/" # read the html of the url tran <- read_html(url) # start html session session <- html_session(url) # all the episode names. Found ".post-link" using the SelectorGadget chrome extension episode_names <- tran %>% html_nodes('.post-link') %>% html_text() Now that we have a list of the episode links, I loop through each episode and extract the text and dates, and create a dataframe # new empty data frame which we will then bind all the episode information to iteratively ep_df <- data_frame() # looping through each episode ## Not run: for (i in episode_names){ ep_name <- i # setting the episode name to the link title ep_date <- session %>% # pulling the date follow_link(i) %>% # following each episode link html_nodes("time") %>% # again found the node using SelectorGadget html_text() ep_text <- session %>% # pulling actual transcript text follow_link(i) %>% html_nodes("p") %>% html_text() %>% data_frame() %>% # making this a dataframe, mutate() to ad the other variables mutate(ep_name = ep_name, ep_date = ep_date) ep_df <- rbind(ep_df, ep_text) # binding them all together Sys.sleep(3) # Sys.sleep(3) pauses the loop for 3s so as not to overwhelm website's server }

Cleaning the data Now have a dataframe of all of the text from these episodes. It’s messy, so first will remove some of the stuff that we don’t want– “tweet” and stuff like that which snuck into the transcription #rename first column, this is annoying because it's called "." ep_df$text <- ep_df$. ep_df <- ep_df %>% select(2:4) # trim white spice from text, removes lurking spaces at the end of ep_df$text <- trimws(ep_df$text, which = c("both")) # remove "Tweet", "Listen", "Link", etc. which snuck into the episode text ep_df <- ep_df %>% filter(text != "Listen") %>% filter(text != "Tweet") %>% filter(text != "Link") %>% filter(text != "[applause]") head(ep_df)

Analysis using tidytext Now that we have a reasonably clean dataframe, going to tokenize and look at common words etc # tokenized words, using tidytext package. This breaks the dataframe up by word ep_words <- ep_df %>% unnest_tokens(word, text) head(ep_words) Calculating term frequnecy/inverse document frequencies (tf-idfs). This is a metric often used in search engines to determine a word’s uniqueness / importance to a body of text # calculating the tf-idf for each word, each year tfs <- ep_words %>% count(ep_name, word, sort = TRUE) %>% ungroup() %>% bind_tf_idf(word, ep_name, n) # removing names from the dataset, because they generally have the highest tf-idf score for a given episode. Thanks Hadley Wickham for the babynames package! names <- babynames %>% filter(year > 1980) %>% # taking names from the past 40ish years rename(word = name) %>% # renaming the column to "word" to make the anti-join simple mutate(word = tolower(word)) # making the names lowercase topics <- tfs %>% anti_join(names, by = "word") %>% # removingt he names arrange(-tf_idf) # removing certain unusual names that don't appear in the babynames dataset. topics <- topics %>% filter(word != "tripleh") %>% filter(word != "naval") %>% filter(word != "jocko") # writing a csv to save this locally write.csv(topics, "tf_topics.csv") We have a list of top terms by tf_idf for each episode, and can now visualize these Showing highest TF-IDF by episode, this is a useful way of doing topic analysis. I bring in the data here from local sources so as to not have the above code run in the demo. # I cleaned up the episode titles by hand a little in google sheets to make them clearer because they weren't all in the same format, easy because there were only ~20 episodes topics <- read_csv("~/RProjects/tf_topics.csv") ## Warning: Missing column names filled in: 'X1' [1] ## Parsed with column specification: ## cols( ## X1 = col_integer(), ## ep_name = col_character(), ## word = col_character(), ## n = col_integer(), ## tf = col_double(), ## idf = col_double(), ## tf_idf = col_double(), ## name = col_character() ## ) tf_ep_titles_cleaned <- read_csv("~/RProjects/tf_ep_titles_cleaned.csv") ## Parsed with column specification: ## cols( ## name = col_character(), ## ep_name = col_character() ## ) # joining the topics dataset with the cleaned up names topics <- topics %>% inner_join(tf_ep_titles_cleaned) ## Joining, by = c("ep_name", "name") head(topics) ## # A tibble: 6 x 8 ## X1 ep_name word n tf idf tf_idf name ## <int> <chr> <chr> <int> <dbl> <dbl> <dbl> <chr> ## 1 1 #244: The Quiet Master… bitco… 149 0.00527 2.22 0.0117 Naval … ## 2 2 Ep 57: Pavel Tsatsouli… kettl… 8 0.00240 3.61 0.00866 Pavel … ## 3 3 Ep 65: Supplements, Bl… gluco… 39 0.00339 2.22 0.00754 Dr. Pe… ## 4 4 Ep 65: Supplements, Bl… vo2 24 0.00209 3.61 0.00753 Dr. Pe… ## 5 5 Ep 65: Supplements, Bl… insul… 34 0.00296 2.51 0.00743 Dr. Pe… ## 6 6 #187: Jocko Willink on… leade… 29 0.00310 2.22 0.00689 Jocko …