Woohoo, so tidy! Now comes the fun part: visualization. The following plots how often houses are mentioned overall, and in each book seperately.

The Harry Potter font looks wonderful, right?

In terms of the data, Gryffindor and Slytherin definitely play a larger role in the Harry Potter stories. However, as the storyline progresses, Slytherin as a house seems to lose its importance. Their downward trend since the Chamber of Secrets results in Ravenclaw being mentioned more often in the final book (Edit – this is likely due to the diadem horcrux, as you will see later on).

I can’t but feel sorry for house Hufflepuff, which never really gets to involved throughout the saga.

Let’s dive into the specific words used in combination with each house. The following code retrieves and counts the single words used in the sentences where houses are mentioned.

Visualize Word-House Combinations

Now we can visualize which words relate to each of the houses. Because facet_wrap() has trouble reordering the axes (because words may related to multiple houses in different frequencies), I needed some custom functionality, which I happily recycled from dgrtwo’s github. With these reorder_within() and scale_x_reordered() we can now make an ordered barplot of the top-20 most frequent words per house.

reorder_within <- function ( x , by , within , fun = mean , sep = "___" , ... ) { new_x <- paste ( x , within , sep = sep ) reorder ( new_x , by , FUN = fun ) } scale_x_reordered <- function ( ... , sep = "___" ) { reg <- paste0 ( sep , ".+$" ) ggplot2 : : scale_x_discrete ( labels = function ( x ) gsub ( reg , "" , x ) , ... ) } w = 10 ; h = 7 ; words_per_house = 20 words_by_houses %>% group_by ( house ) %>% arrange ( house , desc ( word_n ) ) %>% mutate ( top = row_number ( ) ) %>% filter ( top <= words_per_house ) %>% ggplot ( aes ( reorder_within ( word , - top , house ) , word_n , fill = house ) ) + geom_col ( show.legend = F ) + scale_x_reordered ( ) + scale_fill_manual ( values = houses_colors1 ) + scale_color_manual ( values = houses_colors2 ) + facet_wrap ( ~ house , scales = "free_y" ) + coord_flip ( ) + labs ( title = default_title , subtitle = "Words most commonly used together with houses" , caption = default_caption , x = NULL , y = 'Word Frequency' )

Unsurprisingly, several stop words occur most frequently in the data. Intuitively, we would rerun the code but use dplyr::anti_join() on tidytext::stop_words to remove stop words.

words_by_houses %>% anti_join ( stop_words , 'word' ) %>% group_by ( house ) %>% arrange ( house , desc ( word_n ) ) %>% mutate ( top = row_number ( ) ) %>% filter ( top <= words_per_house ) %>% ggplot ( aes ( reorder_within ( word , - top , house ) , word_n , fill = house ) ) + geom_col ( show.legend = F ) + scale_x_reordered ( ) + scale_fill_manual ( values = houses_colors1 ) + scale_color_manual ( values = houses_colors2 ) + facet_wrap ( ~ house , scales = "free" ) + coord_flip ( ) + labs ( title = default_title , subtitle = "Words most commonly used together with houses, excluding stop words" , caption = default_caption , x = NULL , y = 'Word Frequency' )

However, some stop words have a different meaning in the Harry Potter universe. points are for instance quite informative to the Hogwarts houses but included in stop_words .

Moreover, many of the most frequent words above occur in relation to multiple or all houses. Take, for instance, Harry and Ron, which are in the top-10 of each house, or words like table, house, and professor.

We are more interested in words that describe one house, but not another. Similarly, we only want to exclude stop words which are really irrelevant. To this end, we compute a ratio-statistic below. This statistic displays how frequently a word occurs in combination with one house rather than with the others. However, we need to adjust this ratio for how often houses occur in the text as more text (and thus words) is used in reference to house Gryffindor than, for instance, Ravenclaw.

words_by_houses <- words_by_houses %>% group_by ( word ) %>% mutate ( word_sum = sum ( word_n ) ) %>% group_by ( house ) %>% mutate ( house_n = n ( ) ) %>% ungroup ( ) %>% mutate ( ratio = ( word_n / ( word_sum - word_n + 1 ) / ( house_n / n ( ) ) ) ) words_by_houses %>% select ( - word_sum , - house_n ) %>% arrange ( desc ( word_n ) ) %>% head ( )

## # A tibble: 6 x 4 ## house word word_n ratio ## <chr> <chr> <int> <dbl> ## 1 Gryffindor the 1057 2.373115 ## 2 Slytherin the 675 1.467926 ## 3 Gryffindor gryffindor 602 13.076218 ## 4 Gryffindor and 477 2.197259 ## 5 Gryffindor to 428 2.830435 ## 6 Gryffindor of 362 2.213186

words_by_houses %>% group_by ( house ) %>% arrange ( house , desc ( ratio ) ) %>% mutate ( top = row_number ( ) ) %>% filter ( top <= words_per_house ) %>% ggplot ( aes ( reorder_within ( word , - top , house ) , ratio , fill = house ) ) + geom_col ( show.legend = F ) + scale_x_reordered ( ) + scale_fill_manual ( values = houses_colors1 ) + scale_color_manual ( values = houses_colors2 ) + facet_wrap ( ~ house , scales = "free" ) + coord_flip ( ) + labs ( title = default_title , subtitle = "Most informative words per house, by ratio" , caption = default_caption , x = NULL , y = 'Adjusted Frequency Ratio (house vs. non-house)' )

This ratio statistic (x-axis) should be interpreted as follows: night is used 29 times more often in combination with Gryffindor than with the other houses.

Do you think the results make sense:

Gryffindors spent dozens of hours during their afternoons , evenings , and nights in the, often empty , tower room, apparently playing chess ? Nevile Longbottom and Hermione Granger are Gryffindors , obviously, and Sirius Black is also on the list. The sword of Gryffindor is no surprise here either.

spent of during their , , and in the, often , room, apparently playing ? Nevile and Hermione are , obviously, and Black is also on the list. The is no surprise here either. Hannah Abbot , Ernie Macmillan and Cedric Diggory are Hufflepuffs . Were they mostly hot curly blondes interested in herbology ? Nevertheless, wild and aggresive seem unfitting for Hogwarts most boring house.

, and are . Were they mostly interested in ? Nevertheless, and seem unfitting for Hogwarts most boring house. A lot of names on the list of Helena Ravenclaw ’s house. Roger Davies , Padma Patil, Cho Chang , Miss S. Fawcett , Stewart Ackerley , Terry Boot, and Penelope Clearwater are indeed Ravenclaws , I believe. Ravenclaw’s Diadem was one of Voldemort horcruxes. Alecto Carrow, Death Eater by profession, was apparently sent on a mission by Voldemort to surprise Harry in Rawenclaw’s common room (source), which explains what she does on this list. Can anybody tell me what bust , statue and spot have in relation to Ravenclaw ?

’s house. , Patil, , Miss S. , , Boot, and are indeed , I believe. was one of Voldemort horcruxes. Carrow, Death Eater by profession, was apparently sent on a mission by Voldemort to surprise Harry in Rawenclaw’s common room (source), which explains what she does on this list. Can anybody tell me what , and have in relation to ? House Slytherin is best represented by Gregory Goyle, one of the members of Draco Malfoy’s gang along with Vincent Crabbe. Pansy Parkinson also represents house Slytherin. Slytherin are famous for speaking Parseltongue and their house’s gem is an emerald. House Gaunt were pure-blood descendants from Salazar Slytherin and apparently Viktor Krum would not have misrepresented the Slytherin values either. Oh, and only the heir of Slytherin could control the monster in the Chamber of Secrets.

Honestly, I was not expecting such good results! However, there is always room for improvement.

We may want to exclude words that only occur once or twice in the book (e.g., Alecto) as well as the house names. Additionally, these barplots are not the optimal visualization if we would like to include more words per house. Fortunately, Hadley Wickham helped me discover treeplots. Let’s draw one using the ggfittext and the treemapify packages. w = 12 ; h = 8 ; library ( ggfittext ) library ( treemapify ) words_by_houses %>% filter ( word_n > 3 ) %>% filter ( ! grepl ( regex_houses , word ) ) %>% group_by ( house ) %>% arrange ( house , desc ( ratio ) , desc ( word_n ) ) %>% mutate ( top = seq_along ( ratio ) ) %>% filter ( top <= words_per_house ) %>% ggplot ( aes ( area = ratio , label = word , subgroup = house , fill = house ) ) + geom_treemap ( ) + geom_treemap_text ( aes ( col = house ) , family = "HP" , place = 'center' ) + geom_treemap_subgroup_text ( aes ( col = house ) , family = "HP" , place = 'center' , alpha = 0.3 , grow = T ) + geom_treemap_subgroup_border ( colour = 'black' ) + scale_fill_manual ( values = houses_colors1 ) + scale_color_manual ( values = houses_colors2 ) + theme ( legend.position = 'none' ) + labs ( title = default_title , subtitle = "Most informative words per house, by ratio" , caption = default_caption ) A treemap can display more words for each of the houses and displays their relative proportions better. New words regarding the houses include the following, but do you see any others? Slytherin girls laugh out loud whereas Ravenclaw had a few little, pretty girls ?

whereas Ravenclaw had a ? Gryffindors, at least Harry and his friends, got in trouble often, that is a fact.

often, that is a fact. Yellow is the color of house Hufflepuff whereas Slytherin is green indeed.

is the color of house Hufflepuff whereas Slytherin is indeed. Zacherias Smith joined Hufflepuff and Luna Lovegood Ravenclaw.

joined Hufflepuff and Ravenclaw. Why is Voldemort in camp Ravenclaw?! In the earlier code, we specified a minimum number of occurances for words to be included, which is a bit hacky but necessary to make the ratio statistic work as intended. Foruntately, there are other ways to estimate how unique or informative words are to houses that do not require such hacks.