14 min read

The TidyTuesday project by the R for Data Science community is a great resource for practicing up on Data analytics. Every Tuesday, they release a dataset to the public, and I’ve found it to be a great source of motivation and practice. Every dataset is different, and every dataset has its own story to tell. Plus, it’s always a treat to look at the various ways in which people go through the same dataset.

I’ve always been an avid fan of anime in general, ever since I first started watching Naruto back in 8th Grade. Doing something in R related to anime has been in the back of my mind for a while now, so when I came across this particular dataset on the TidyTuesday initiate, I was stoked! I decided to get my hands dirty with this dataset, which brings us to this, my second blog post.

The first thing I do is load the dataset via the URL provided on the TidyTuesday page of the dataset.

After reading it, I load up the tidyverse package to streamline my analysis, along with briefly going through the data via the glimpse() function. At first glance, a few columns seem to be of note to me, namely the synopsis , score , and studio columns.

I use select() to keep only the columns I think are relevant to my analysis. In this case, I remove columns containing Japanese characters, redundant information or columns that simply don’t seem that interesting to me. Keeping columns that you won’t be using will just slow down your analysis later on.

tidy_anime <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-23/tidy_anime.csv") library(tidyverse) tidy_anime %>% glimpse()

## Observations: 77,911 ## Variables: 28 ## $ animeID <dbl> 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,... ## $ name <chr> "Cowboy Bebop", "Cowboy Bebop", "Cowboy Bebop",... ## $ title_english <chr> "Cowboy Bebop", "Cowboy Bebop", "Cowboy Bebop",... ## $ title_japanese <chr> "<U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>", "<U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>", "<U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>", "<U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0>... ## $ title_synonyms <chr> "[]", "[]", "[]", "[]", "[]", "[]", "[\"Cowboy ... ## $ type <chr> "TV", "TV", "TV", "TV", "TV", "TV", "Movie", "M... ## $ source <chr> "Original", "Original", "Original", "Original",... ## $ producers <chr> "Bandai Visual", "Bandai Visual", "Bandai Visua... ## $ genre <chr> "Action", "Adventure", "Comedy", "Drama", "Sci-... ## $ studio <chr> "Sunrise", "Sunrise", "Sunrise", "Sunrise", "Su... ## $ episodes <dbl> 26, 26, 26, 26, 26, 26, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ status <chr> "Finished Airing", "Finished Airing", "Finished... ## $ airing <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE... ## $ start_date <date> 1998-04-03, 1998-04-03, 1998-04-03, 1998-04-03... ## $ end_date <date> 1999-04-02, 1999-04-02, 1999-04-02, 1999-04-02... ## $ duration <chr> "24 min per ep", "24 min per ep", "24 min per e... ## $ rating <chr> "R - 17+ (violence & profanity)", "R - 17+ (vio... ## $ score <dbl> 8.81, 8.81, 8.81, 8.81, 8.81, 8.81, 8.41, 8.41,... ## $ scored_by <dbl> 405664, 405664, 405664, 405664, 405664, 405664,... ## $ rank <dbl> 26, 26, 26, 26, 26, 26, 164, 164, 164, 164, 164... ## $ popularity <dbl> 39, 39, 39, 39, 39, 39, 449, 449, 449, 449, 449... ## $ members <dbl> 795733, 795733, 795733, 795733, 795733, 795733,... ## $ favorites <dbl> 43460, 43460, 43460, 43460, 43460, 43460, 776, ... ## $ synopsis <chr> "In the year 2071, humanity has colonized sever... ## $ background <chr> "When Cowboy Bebop first aired in spring of 199... ## $ premiered <chr> "Spring 1998", "Spring 1998", "Spring 1998", "S... ## $ broadcast <chr> "Saturdays at 01:00 (JST)", "Saturdays at 01:00... ## $ related <chr> "{'Adaptation': [{'mal_id': 173, 'type': 'manga...

tidy_anime %>% head(20) #different genres are separated into different rows

## # A tibble: 20 x 28 ## animeID name title_english title_japanese title_synonyms type source ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1 Cowb~ Cowboy Bebop <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>~ [] TV Origi~ ## 2 1 Cowb~ Cowboy Bebop <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>~ [] TV Origi~ ## 3 1 Cowb~ Cowboy Bebop <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>~ [] TV Origi~ ## 4 1 Cowb~ Cowboy Bebop <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>~ [] TV Origi~ ## 5 1 Cowb~ Cowboy Bebop <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>~ [] TV Origi~ ## 6 1 Cowb~ Cowboy Bebop <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7>~ [] TV Origi~ ## 7 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 8 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 9 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 10 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 11 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 12 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 13 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 14 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 15 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 16 5 Cowb~ Cowboy Bebop~ <U+30AB><U+30A6><U+30DC><U+30FC><U+30A4><U+30D3><U+30D0><U+30C3><U+30D7> <U+5929><U+56FD><U+306E>~ "[\"Cowboy Be~ Movie Origi~ ## 17 6 Trig~ Trigun <U+30C8><U+30E9><U+30A4><U+30AC><U+30F3> [] TV Manga ## 18 6 Trig~ Trigun <U+30C8><U+30E9><U+30A4><U+30AC><U+30F3> [] TV Manga ## 19 6 Trig~ Trigun <U+30C8><U+30E9><U+30A4><U+30AC><U+30F3> [] TV Manga ## 20 6 Trig~ Trigun <U+30C8><U+30E9><U+30A4><U+30AC><U+30F3> [] TV Manga ## # ... with 21 more variables: producers <chr>, genre <chr>, studio <chr>, ## # episodes <dbl>, status <chr>, airing <lgl>, start_date <date>, ## # end_date <date>, duration <chr>, rating <chr>, score <dbl>, ## # scored_by <dbl>, rank <dbl>, popularity <dbl>, members <dbl>, ## # favorites <dbl>, synopsis <chr>, background <chr>, premiered <chr>, ## # broadcast <chr>, related <chr>

theme_set(theme_light()) tidy_anime_select <- tidy_anime %>% select(-title_japanese, -title_synonyms, -background, -related)

It’s always a good idea to look at the distribution of the data you have. I don’t have a particular preference for anime, but I’m still curious as to what genre generally gets the highest ratings from the plethora of users out there. To get some idea of the same, I plot a histogram of the ratings, across genres, using facet_wrap() and geom_histogram() .

tidy_anime_select %>% filter(!is.na(genre)) %>% ggplot(aes(score)) + geom_histogram(aes(fill = genre)) + facet_wrap(~genre) + guides(fill = FALSE) + labs(title = "Distribution of ratings across genres")

## Warning: Removed 173 rows containing non-finite values (stat_bin).

Notice I used colorRampPalette() to create a custom color scheme. Most pre-defined palettes present in RColorBrewer have around 8 colors, which leads to a problem when you’ve got more than 8 categorical variables. colorRampPalette() creates a new palette having as many unique colors as you want by interpolating between the colors that are already present in the pre-defined palette.

In the following section of code, I create a simple bar-plot that shows the median ratings for each genre, followed by the median ratings for the studios with the 20 highest median ratings. I do this using geom_col() and geom_label() .

#creating a custom color scheme library(RColorBrewer) custColorsDiscrete <- colorRampPalette(brewer.pal(9, "Blues"))(41) tidy_anime_select %>% group_by(genre) %>% summarise(avgScore = median(score, na.rm = TRUE)) %>% arrange(desc(avgScore)) %>% mutate(genre = fct_reorder(genre, avgScore)) %>% filter(!is.na(genre)) %>% ggplot(aes(genre, avgScore)) + geom_col(aes(fill = genre), color = "black") + geom_label(aes(label = avgScore), label.padding = unit(0.10, "lines"), alpha = 0.8, nudge_y = -0.15) + coord_flip() + scale_fill_manual(values = custColorsDiscrete) + labs(title = "Median rating for all genres", x = "Genre", y = "Median score", subtitle = "People love mystery thrillers") + guides(fill = FALSE)

custColorsDiscrete2 <- colorRampPalette(brewer.pal(9, "Blues"))(21) tidy_anime_select %>% group_by(studio) %>% summarize(avgScore = median(score, na.rm = TRUE)) %>% arrange(desc(avgScore)) %>% top_n(20) %>% mutate(studio = fct_reorder(studio, avgScore)) %>% ggplot(aes(studio, avgScore)) + geom_col(aes(fill = studio), color = "black") + geom_label(aes(label = avgScore), label.padding = unit(0.15, "lines"), alpha = 0.8, nudge_y = -0.20) + scale_fill_manual(values = custColorsDiscrete2) + coord_flip() + guides(fill = FALSE) + labs(title = "Median rating per studio", subtitle = "Top 20 studios taken", x = "Studio", y = "Median score")

## Selecting by avgScore

There are more than 40 genres, which really only leads to cluttered visualizations. To clear up my visualizations a bit, I store the most common genres across the dataset in a vector called topGenres .

A couple of studios I’m familiar with include studios like Toei animation, Ufotable, and of course, Studio Ghibli. However, I’m not really sure as to how common they really are, and how diverse they are in their offerings.

I attempt to answer that with the following code.

I group the dataset according to the genre and studio, after which I can easily see the most common genre-studio pairings via summarize() and arrange() . I also use the topGenres vector created earlier to look at studios that only have shows present in the most common genres.

fct_reorder() from the forcats package is something I’ve used countless times to reorder categorical features based on a metric.

topGenres <- tidy_anime_select %>% count(genre, sort = TRUE) %>% head() %>% pull(genre) #41 genres tidy_anime %>% filter(!is.na(studio) & !is.na(genre)) %>% group_by(genre, studio) %>% summarize(count = n()) %>% arrange(desc(count)) %>% top_n(5) %>% add_count(genre) %>% ungroup() %>% filter(genre %in% topGenres) %>% mutate(studio = fct_reorder(studio, count, sum)) %>% ggplot(aes(studio, count)) + geom_col(aes(fill = genre), alpha = 0.8) + coord_flip() + scale_fill_brewer(palette = "Spectral") + labs(title = "Top anime studios and the genres of their products", x = "Studio name", y = "Number of shows", fill = "Genre")

## Selecting by count

Eventually I want to be predicting the score based off various other parameters present in the dataset. Looking at the distribution, the score variable seems to be present in a log-normal distribution. I’ll try to predict the logarithm of the scores later.

tidy_anime_select %>% ggplot(aes(score)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 174 rows containing non-finite values (stat_bin).

The description column features a wealth of unstructured data worth looking into. The tidytext package is a great tool for exploring the relationships within text, while also being great with the rest of the tidyverse .

The unnest_tokens function here tokenizes the text into individual words by default, after some basic pre-processing steps (removing numbers, making everything lower case, etc.) I then use anti_join() with a lexicon of stop words that is provided by the package itself to get rid of words used commonly (words like a, the, an, of…).

I then perform steps similar to what I performed earlier while examining the relationships between studios and genres, except this time I also use scale_x_reordered() (from the drlib package) to reorder the variables within each individual facet. This gives us the most common words present in each popular genre.

library(tidytext) library(drlib) tidy_anime_words <- tidy_anime_select %>% unnest_tokens(word, synopsis) %>% anti_join(stop_words)

## Joining, by = "word"

tidy_anime_words %>% filter(genre %in% topGenres) %>% count(genre, word, sort = TRUE) %>% filter(!word %in% c("source", "mal", "written", "rewrite", "world", "life", "ann")) %>% group_by(genre) %>% top_n(10) %>% ungroup() %>% mutate(word = fct_reorder(word, n)) %>% ggplot(aes(reorder_within(word, n, genre),n)) + geom_col(aes(fill = genre)) + coord_flip() + scale_x_reordered() + facet_wrap(~genre, scales = "free") + labs(title = "Most common words in synopses per genre", subtitle = "Extremely common words removed", x = "Word", y = "Count") + guides(fill = FALSE)

## Selecting by n

When dealing with text data, there may be some words that are common throughout the dataset despite not being a common English stop word. It is highly likely, then, that these words might be present across categories, thus limiting the insights you could gather from each category (mysterious in the above plot, for example).

The tf-idf metric attempts to solve this problem by trying to find words that are more specific to a particular category than the rest. Extremely common words have lower ranks according to this metric. The bind_tf_idf() function from the tidytext package helps us calculate the metric for each word present within each genre, visualized here:

tidy_anime_words %>% filter(genre %in% topGenres) %>% count(genre, word, sort = TRUE) %>% bind_tf_idf(word, genre, n) %>% arrange(desc(tf_idf)) %>% group_by(genre) %>% top_n(10) %>% ungroup() %>% mutate(word = fct_reorder(word, tf_idf)) %>% ggplot(aes(reorder_within(word, tf_idf, genre),tf_idf)) + geom_col(aes(fill = genre)) + coord_flip() + scale_x_reordered() + scale_y_continuous(labels = scales::percent_format()) + facet_wrap(~genre, scales = "free") + labs(title = "Most specific words in synopses per genre", subtitle = "Most of them seem to be character names", x = "Word", y = "tf_idf", fill = "Genre")

## Selecting by tf_idf

Recall that our data also contains a variety of columns containing time data, leading to a wealth of questions worth looking into. This could allow us to get information pertaining to, say, timings related to specific genres, or whether timings affect average ratings.

The lubridate package contains a multitude of functions that help in parsing data-time data, present in a variety of formats. I use that here in order to parse the time-related features as date-time features, and not as regular numeric/character features.

I then use geom_density() to look at the density of the time slots between the most common genres.

library(lubridate)

## ## Attaching package: 'lubridate'

## The following object is masked from 'package:base': ## ## date

tidy_anime_select %>% mutate(year = year(start_date)) %>% ggplot(aes(year)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 238 rows containing non-finite values (stat_bin).

tidy_anime_time <- tidy_anime_select %>% extract(broadcast, into = c("time"), regex = "(\\d\\d:\\d\\d)") %>% select(time, genre, score) %>% filter(!is.na(time), genre %in% topGenres) %>% mutate(time = hm(time), hour = hour(time)) tidy_anime_time %>% ggplot(aes(hour)) + geom_density(aes(fill = genre), alpha = 0.5) + labs(title = "Density of time slots", subtitle = "Adventure shows show a sharp spike around 18:00 JST", x = "Hour", y = "Density", fill = "Genre", caption = "All timings in JST") + scale_y_continuous(labels = scales::percent_format())

I also attempted to train a rudimentary linear model to see the impact the hour a show airs might have on average ratings. Turns out, the hour is statistically significant when it comes to predicting ratings.

I use the tidy() function from the broom package to convert my linear model into a tibble, after which I plot the average impact each hour has on ratings, along with their respective confidence intervals.

library(broom) tidy_anime_time %>% mutate(hour = as.factor(hour)) %>% lm(score~hour, data = .) %>% tidy(conf.int = TRUE) %>% mutate(term = str_replace(term, "hour", "")) %>% #for readability filter(term!='(Intercept)') %>% arrange(desc(estimate)) %>% mutate(term = fct_reorder(term, estimate)) %>% ggplot(aes(term, estimate)) + geom_point() + geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) + coord_flip() + labs(title = "Impact of airing hours on average ratings", subtitle = "Afternoon shows do not seem to go well with the viewers", x = "Hour", y = "Impact on rating", caption = "Hours in JST")

I had mentioned earlier that I wanted to see how I could predict the average score of an anime, given the features of the dataset provided. One way of going about that is to use the description itself to see which words are more likely to positively or negatively impact ratings. One way to do this is to use lasso regression, which is an example of regularized regression. In a nutshell, this means that instead of using all the available words all at once as predictors, our model will associate a penalty term that increases as the complexity of our model increases. This helps in minimizing the number of features in our model, while maximizing the accuracy of it.

I use the glmnet and broom packages here. The cast_sparse() function present in the tidytext package makes converting our data frame into a sparse matrix extremely easy. Our matrix will contain 1 row for each unique anime, and 1 column for each word present in our data set.

I use cross-validation to train my model.

library(Matrix) library(glmnet) tidy_anime_word_matrix <- tidy_anime_words %>% filter(str_detect(word, "[a-zA-Z]")) %>% cast_sparse(animeID, word) dim(tidy_anime_word_matrix)

## [1] 13114 42100

anime_ids <- as.integer(rownames(tidy_anime_word_matrix)) scores <- log2(tidy_anime_words$score[anime_ids]) #for getting the log of the scores present in the sparse matrix #Training a cross-validated lasso model cv_glmnet_mod <- cv.glmnet(tidy_anime_word_matrix, scores)

Upon visualizing the model, we can see that overfitting takes place relatively soon, after we cross around 527 terms. The dotted lines represent the values of lambda that minimized the error of our model. We can easily retrieve those exact values by using broom .

In order to see the impact each word present in our data has on the score response variable, we need to first get the model that had the best fit on the data as a function of the lambda penalty parameter. This is done by looking at the glmnet.fit model to find the optimal, final model built after cross-validation.

To get the impact of each word based off our best fit model, I simply filter for the terms which were put into the model when the value of lambda led to predictions within 1 standard error, after which I use tidy() to visualize the corresponding terms along with their coefficients.

plot(cv_glmnet_mod)

cv_glmnet_mod$glmnet.fit %>% tidy() %>% filter(term %in% c("official", "video", "season", "franchise", "rewrite")) %>% ggplot(aes(lambda, estimate)) + geom_line(aes(color = term)) + labs(title = "Lambda vs estimate", subtitle = "Overfitting on a few of these terms", caption = "Terms picked arbitrarily")

cv_glmnet_mod$glmnet.fit %>% tidy() %>% filter(lambda == cv_glmnet_mod$lambda.1se) %>% filter(term!="(Intercept)") %>% mutate(direction = ifelse(estimate>0,"positive", "negative")) %>% group_by(direction) %>% top_n(9, abs(estimate)) %>% ungroup() %>% mutate(term = fct_reorder(term, estimate)) %>% ggplot(aes(term, estimate)) + geom_col(aes(fill = direction)) + coord_flip() + labs(title = "Impact of each word on anime score", subtitle = "Calculated using lasso regression", x = "Word", y = "Coefficient", fill = "Direction of impact", caption = "Majorly Japanese words seem to be negative impactors on the overall score")

And there you have it. Through this analysis we did quite a few things:

Exploratory visualizations,

tf-idf to find out words specific within genres,

Manipulating time data, and using the same as predictors,

Manipulating text as predictors.

That was fun, but there are quite a few things I could’ve done as well, or improved upon: