The Words of Witches & Wizards

Using Data Science techniques to analyse scripts and language from the Harry Potter film franchise

As one of the most successful movie franchises of all time, Harry Potter is engrained into the psyche of several generations of adults. Having grossed over $9bn worldwide, it’s safe to say that if you don’t know what Harry Potter is you must have been living under a rock for the last 25 years. Nevertheless, if for some reason you haven’t got round to watching/reading yet, this article contains spoilers…

Those of you who have seen my work will have an idea of how this is going to go down - I’m simply going to apply what I’ve done before to a new dataset. I’ve published a total of 5 analytical pieces to date which look at scripts from the Game of Thrones TV series and pick them apart using Data Science techniques — 3 of which look at the words used and how they reflect characters and storylines (see here for: the original article, part 2, and Season 8. Or view my profile to see all analytical articles). I intend to do the same thing here, looking at how patterns in the use of language can bring new insight to the world of Harry Potter.

Tech Note/Caveat

As with my Game of Thrones analysis, the data has been sourced and compiled by myself, primarily using the R programming language. This time I have used a variety of sources for the scripts themselves (see footnote) and again spent a long time transforming them into a dataset that could be analysed. Please note that there is one important caveat to this analysis: I was unable to find a useable script for the 5th film, Order of the Phoenix. As such all analysis here excludes this film. It is also worth noting that one or two of the scripts are ‘final drafts’ or equivalent so may have the odd line or two which is slightly different to the actual film. Despite these caveats the data in the 7 remaining scripts (6300 lines - over 66k words - spoken by 180 different characters) can be analysed to give some very interesting results.

The star of the show

To begin with I decided to stick to the simple but effective technique of looking at the most common word spoken by each character. Below you can see the results for 24 of the most prominent characters. I’m sure you’ll be able to see the pattern…

The most common word used by each character in 7 of the 8 Harry Potter films excluding film 5 (Order of the Phoenix). All stopwords (he, it, the, of etc.) have been removed. *proportion of non-stopwords that the most common word accounts for

Perhaps unsurprisingly, 13 of the 19 main characters mention the eponymous hero more than they talk about anything else.

Dobby refers to himself in the third person and this is reflected here.

Note that both Tom Riddle and Voldemort are listed, I decided to keep them separate for comparisons between Tom Riddle when he was young and after he became Lord Voldemort. Here we can see that Riddle was a polite and courteous student and later on in life became obsessed with ‘Potter’.

Just to note, I should mention that the analysis for Mad-Eye Moody faces a similiar problem to the Riddle/Voldemort issue: the lines spoken by Barty Crouch Jr while he is disguised as Alastor “Mad-Eye” Moody are listed as lines for Moody.

You can easily see the difference between friends who refer to Harry by his first name (except Voldemort) and those who instead call him ‘Potter’ disdainfully - Draco, Snape and Barty Crouch Jr (as Moody).

A key theme throughout the Harry Potter series is the similiarities (real or imagined) between Harry and a young Tom Riddle. It is interesting to see they had the same most commonly spoken word.

As this Potter-centric pattern amongst characters isn’t a result I was expecting, I thought it might be interesting to see what these characters’ most common word would be if it wasn’t ‘Harry’ or ‘Potter’. This can be seen below for the 13 characters mentioned above:

The most common word used by each character in 7 of the 8 Harry Potter films excluding film 5 (Order of the Phoenix). All stopwords (he, it, the, of etc.) have been removed. *proportion of non-stopwords, excluding ‘harry’ and ‘potter’, that the most common word accounts for.

Some of these are unsurprising (Luna) and others aren’t particularly insightful, even if I do personally find them amusing (Lockhart & Hagrid)

It is interesting to see that Malfoy refers to his father so often, either as part of a threat or as a boast - clearly he sees Lucius as a role model.

Hermione & Ron have the same most common word, but they both use it in different contexts. Hermione is concerned with knowledge and who “knows” what whereas Ron appears to use it in the context of “you know” frequently (e.g. “Its for your own good, you know”). Although it is also the most common word for Neville and Sirius, this is mostly because they don’t say as much overall.

Hagrid’s Slow Death

Another way to look at the data is to see how it varies over the course of the films. This can give an indication of how prevalent each character was in the respective films. This is shown below for the 5 characters who spoke the most over all 7 films:

Word count over the 7 films for the 5 characters who had the highest total word count (each spoke over 3000 words over the franchise).

Harry is the most frequent talker in all films except Goblet of Fire (which is more action-based and in fact has the lowest overall word count of any of the films) and the Deathly Hallows Part 1 (where Ron and Hermione feature more prominently, as they are on the run).

Despite Dumbledore’s untimely death and subsequent exclusion from the Deathly Hallows Part 1, he goes on to have plenty to say in the final film, in the form of flashbacks.

Hagrid seems to have been phased out over all the films, having less of a role the more the franchise progressed- to the point where he had only 6 words in the final film: “Harry! No! What’re yeh doin’ ‘ere?!” - when Harry goes to see Voldemort in the Forbidden Forest.

‘Accio spells’

The next thing I was interested to look at was the use of spells throughout the film. Magic is such an important part of the world of Harry Potter so no analysis of the films would be complete without considering it.

Going by the data I’ve sourced, a total of 97 spells (which were clearly spoken aloud) were cast over the 7 films in question, below you can see the 10 most popular:

The 10 most common spells spoken aloud according to scripts for Harry Potter films 1–8, excluding film 5. ‘Most frequent caster’ indicates the character who spoke the name of the spell the most. In cases where there is a tie for ‘Most frequent caster’, it is broken alphabetically.

Lumos is an incredibly useful spell (it casts light), added to the fact that Harry uses it 4 times in a single scene at the beginning of the Prisoner of Azkaban, it is unsurprising that it comes in first place.

Similiarly, Dementors are such a significant part of the third film, so Expecto Patronum’s popularity reflects this, with 7 of the 9 uses being in the Prisoner of Azkaban. (Note: this would likely be much higher if the 5th film was included due to the ‘Harry teaching Dumbledore’s Army how to cast a patronus’ scene)

The numbers for Riddikulus have been boosted by a ‘teaching scene’ whereby Remus Lupin instructs his Defence Against the Dark Arts class how to use the spell and a succession of students all cast it.

‘Avada Kedavra’: firstly, you may have expected this to be higher. This is probably one of the most common spells that is not spoken aloud (Voldemort — who uses it the most — rarely says the full phrase aloud). Secondly, Gregory Goyle being the most frequent caster might be raising a few eyebrows. He uses it once in the Deathly Hallows Part 2, in an attempt to kill Harry. The two other appearances of the spell are from different people (Voldemort & Snape), so this three-way tie is broken alphabetically and as a result Goyle looks much more dangerous than he actually is!

Alohomora? That’s so last year

Another way to break down the data is to look at overall casting of spells over the films and how the popularity of various spells fluctuates. This can be seen below, with a label denoting the most commonly used spell in each film.

The total number of spells cast in each film, with a label denoting the most used spell in each.

The number of spells cast tends to increase as the series progresses, with two major exceptions. This makes sense as the latter films contain more fighting/duel scenes.

The Prisoner of Azkaban: the spike here is caused by the same thing that makes Riddikulus the third most used spell — the scene where Lupin teaches the spell to a class of students.

The Goblet of Fire: this drop is unexpected and I believe may be due to the version of the script I am using. It appears to exclude spells, preferring only to say in the stage directions “Harry uses a spell” etc. The two references to the imperious curse are just that, references rather than the spell actually being cast.

Summary

Thank you for reading, if you enjoyed please do ‘clap’ below and give Data Slice a follow for similar articles looking at all sorts of topics! Leave a comment below if you have a particular subject you’d like me to analyse and present back in a post. Finally, thank you to my good friend Lauren for giving me the idea to apply my script analysis techniques to the Harry Potter films.