Semantics — What does data science reveal about Clinton and Trump ?

Data science has many fields of applications. From image processing to AI, its use is ubiquitous. One of its applications, semantic analysis, is very helpful in social media monitoring. Here, we did not focus on tweets or Facebook comments, but on politics.

On the 21st of July, Donald Trump accepted the Republican Nomination for President of the US on the last day of the Republican National Convention (RNC) in Cleveland, Ohio. One week later, on the 28th, Hillary Clinton was accepting the Democratic Party’s nomination for president in Philadelphia.

Supported by their family and hundreds of thousands of fans, they wrote a new page of the history of the United States delivering their acceptance speech. We’ve analysed their words to better understand the hidden components of their political communication. This study focuses on three main features : vocabulary, style and rhythm.

You have my word

One way to evaluate who has the largest vocabulary is to see how many unique words the speakers use. In order to do that, we need to remove some of the most common words used in the English language (“the”, “a”, “of”, …) as they don’t bear any informative value. These words are called stopwords : a list of these words can be found here. Secondly, we don’t want to count words twice : “leaders” and “leader” must be viewed as equal, like “problems” and “problem” or “am” and “are”. For this, we used the Snowball Stemmer algorithm.

We find out that Trump’s speech is made up of roughly 13% distinct root words (965 distinct stemmed words for a 7460 word-long text). Each word is, on average, repeated 7.7 times. On the other hand, Clinton’s has 17% distinct words, and each word is on average repeated ~6 times. The difference is significant : only 480 words are needed to write 80% of Donald Trump’s speech, while Clinton gets the same result using 665 words. That’s a 38% difference. Good, it means we are starting to see some results !

How many words are needed to write 80% of a candidate’s speech ?

The efficiency of a speech relies partly on the style of its orator. In our case, we would like to find which are the candidates’ favorite words. To get the “Trumpian” and the “Clintonian” vocabularies, we have to find the words that occur the most in one candidate’s talk and, at the same time, the least in the opponent’s. For example, the word “really” is found 15 times in Trump’s speech but only once in Clinton’s. One way to determine this is to calculate the odds ratio for each word. The odds ratio (here named OR) was, for each word, computed using the following formula :

The first term of the ratio is the probability of a word being in Trump’s vocabulary, and the other one is the probability of the same word being in Clinton’s. The log function allows us to efficiently sort each word in one category or the other : when the probabilities are equal, the log function is null. In any other cases, it is either negative (a word is Clintonian) or positive (a word is Trumpian). Here are the results we got :

Words almost exclusive to Donald Trump

Words almost exclusive to Hillary Clinton

The first thing we notice is that Trump uses short and common words and tends to repeat them a great deal : “really”, “nice”, “great”, “problems”. On a side note, we can get a feeling of some preoccupations of the Republican candidate : “Mexico”, “China” and “Iran”. Globally, Trump’s concerns seem to be more focused on international “problems” than Clinton’s. Most of his mentions of the outside world aimed at inspiring fear and scape goating.

On Hillary’s side, the word range is wider. “Clintonian” words tend to be much rarer than Trump’s. Hillary Clinton actually refers to America a whole lot more than Trump : 27 occurrences (only 5 for Trump). The Clintonian wordset suggests that Clinton’s speech is more focused on national themes. Her typical words include “together”, “campaign” and “hard[working]”. Donald Trump is also mentioned many times in her speech.

Fine observers will note that “Trump” does not appear in the Clintonian wordset above, the reason being that Trump himself made numerous mentions of his last name in his speech (10 times), bringing the ratio way down. As a point of comparison, Clinton’s name is only used twice : once in Hillary’s speech (about her husband Bill Clinton), and once in Trump’s. Moreover, the Clintonian word “Wants” that shows up in the list is mostly used to criticize her opponent (“He wants to divide us […]”, “He wants us to fear the future and fear each other.”). It clearly shows that Clinton talked about Trump, and Trump talked about… himself !

Everyone talks about Trump.

We can also look at the words that both candidates use equally. They represent, in a way, their common concerns. Not surprisingly, this is the case for “job(s)”, “country” and “thinking”. They both use “thank(s)” numerous times, but in a different manner : while Clinton specifically thanked a group of people or an individual, Trump’s “thanks” were mostly employed when the crowd was applauding him.

What’s your tempo ?

Each candidate has his own tempo, depending on her/his background as a speaker. A good start is to evaluate the inner rhythm of their language : we can slice each talk into a pack of sentences, and sentences into a pack of words. We discover that Trump has the longest speech, with 625 sentences and 7460 words, while Clinton’s has 405 sentences and 6088 distinct words. That being said, Trump’s talk has 54% more sentences than his opponent’s while being only 23% longer.

The average sentence length of Trump is around 12 words per sentence. Clinton writes slightly longer sentences with an average of 15 words per sentence. Most of Trump’s sentences are short : more than 21% of Trump’s speech is made of sentences that contain 5 or 6 words. Clinton’s sentence lengths are more evenly distributed, 12-word sentences being the most frequent.

Obama employed as many words per sentence as Trump and Clinton combined

We see a clear difference between the two candidates : Trump’s speech is simpler and faster, Clinton’s more diverse and chilly. But wait ! She did not nail the conversation : Obama, during his first nomination speech, employed an average of 25.7 words per sentence, which is almost equal to Clinton and Trump combined. Obama also repeated himself 24% less than Clinton, and 42% less than Trump. It shows, I think, that while Clinton’s tempo is a bit slower and the sentences a bit more complex in their structure, her speech rhythm is still very close to her opponent’s.

Last word

Natural Language Processing is not an exact science. It only gives us clues and elements by which to understand how speeches are delivered. The dataset is short, and further analysis is required to extract more precise features. But what did we learn from this analysis?

1. Trump thinks everything is “really” “great” and “nice”, and Clinton keeps talking about how to “work” “together” for “America”.

2. Trump talks about himself, Clinton talks about Trump. While Clinton uses a larger vocabulary and more complex sentence structures, it seems that she more or less adapted herself to Trump’s way of talking.

3. Obama’s nomination speeches (both of them) enjoyed a larger vocabulary and a significantly more intricate sentence structure suggests that Trump has disruptively simplified the national discourse.

Original artwork by Fanny Algeyer, graphic and web designer @ReputationSquad

With the contribution of Giulio Zucchini.