What defines a genre? If an Action film is large enough, when does it become an Epic? Is No Country for Old Men a Mystery or a modern Western? Is there any value at all to try to classify Being John Malkovich?

Thanks to The Internet Movie Script Database, I can start to answer a piece of that question: what words constitute the text of each genre. The database contains 897 scripts over 13 categories, not enough for a comprehensive understanding of film scripts, but more than enough to look at some basic insights. So the first thing I wanted to see was a simple count of word frequencies by genre: how many times the word “love” appeared in Romance compared to how many times the word “money” appeared in Crime. The results are below:

Word Frequency

Most of these results are unsurprising: “Woman” is used the most in Romance, “Man” in War. “Hope” and “Fear” are both the most prevalent in War. And, as mentioned, Crime seems to be the most about “Money,” Romance about “Love.” But some results were less obvious: If you look at the scale of the top left chart, it seems that there’s still a gender gap in Hollywood pronouns.

The Codex

After simple word counts, I wanted to look at something I found more interesting: the “characteristic” words of each genre. I wanted to find small groups of words that, if you read in any script, would immediately tell you what genre you’re in. For example, if you heard the line “Commander of the Armies of the North, General of the Felix Legions…” you know without a doubt, that you’re watching a War Epic, even if you’ve never seen Gladiator. To find this kind of list, you don’t just use the most frequently used words, but the most frequently used words that are not used in other genres. Below are the results and methodology.

Genre Wheels

Finally, I wanted to use both of the above tricks and reverse them. That is, I’ve taken all the words in the Codex, and counted their use in each genre. With this information, I’ve constructed a simple model that can take a script and predict that script’s genre. To represent the results, I’ve created what I’m calling “Genre Wheels,” which are charts that indicate exactly how much each script is an Action film, or a Romance, or a War, etc. For example:

The chart shows that based on the script of Sleepless in Seattle, it’s mostly a Romance and a Drama, but also a bit of a Comedy – in this case, the Codex has predicted genre exceptionally well. Here are a few more examples:

However, genre is a complex construction, built from far more than just the words of a film, and 897 scripts aren’t enough to even completely inform each genre’s word choice – so some results are less than perfect:

Ultimately, these results aren’t accurate enough to take as more than entertaining glimpses into what a script can tell you about a genre. Below are 150 randomly selected maps, and you can see them all on this page.

Notes:

As mentioned, the scripts and the genre categorization area all from imsdb.com. It’s not nearly a complete list of every script written, and each script ranges from first draft to final cut. Finally, genre categorization is always a judgment call, and I left the categories as they were presented.