The Brexit debate has reached fever pitch and the EU referendum vote is upon us. Very soon, citizens of the United Kingdom will cast their vote on whether or not to remain in the European Union; a decision that will have repercussions throughout the Western world and beyond.

As expected Twitter is bursting at the seams with Brexit debate, with citizens, blogs, news outlets and popular figures taking to the social media platform to set out their views on the effect that Brexit will have on the economy, immigration, trade, cultural identity and Boris Johnson’s sense of self-importance.

I thought it would be an interesting exercise to produce a snapshot of the debate through an analysis of 10,000 Tweets produced on the 21st of June between around 2pm and 2.30pm GMT (yes that’s 10,000 Tweets in 30 minutes..) I extracted the Tweets into R using the ‘twitteR’ package and the Twitter API and used a number of different tools for analysis, manipulation and visualisation.

The number of Tweets over time remains relatively stable at around 333 Tweets per minute, most likely due to the fact that we only analysed a 30-minute window in time and everybody is clearly very bored in work.

The second thing I wanted to know was how popular the use of hashtags was amongst Twitter users discussing Brexit. Are people repping ‘#brexit’ to try and whip up support for the leave campaign or are people engaging in more genuine discussion and dispensing with tags. It turns out the debate is a distinct mixture of the two, with slightly over 50% of Tweets containing hashtags.The most common hashtag, understandably, is #brexit followed by #voteleave. #china and #brazil feature heavily which is also interesting while #remain, #voteremain and #strongerin lag behind in the popularity charts.

The breakdown of Tweets into original content, retweets and replies is as shown below. This is valuable as it provides an insight into how much content is original and how much of the brexit activity on Twitter consists of users retweeting existing content. The first chart shows the raw values for original tweets (dark blue), replies (light blue) and retweets (light green) whereas the second chart displays the values as a percentage of the total dataset.

The split is fairly even, suggesting that a significant percentage of Tweets are not original content but retweets of existing messages however this is relatively unexciting due to the popularity of retweeting and the nature of Twitter as a social media platform.

We now know the breakdown of content and the most common tags, but how long is an original Tweet relative to the average length of all Tweets in the dataset? The truth is, fairly similar. Moving on..

How much does the popularity of a Tweet relate to its length? On average, longer Tweets receive slightly more retweets (y-axis value) whereas the number of favourites (colour chart) is largely random. However, of the Tweets that received 20 retweets or more, we see a clear trend with longer Tweets tending to receive many more retweets.

When we break the Tweets into those containing either the word ‘leave’ and ‘remain’, the distribution of retweets changes slightly. The leave campaign appear to have a higher number of popular Tweets but the remain campaign have the most popular Tweet with around 69 retweets. Quality over quantity?

#Brexit and the Economy

For this analysis, we collected 1,000 Tweets containing the words ‘brexit’, ‘economy’ and either ‘leave’ or ‘remain’. Once collected, we created a Corpus and conditioned the data for analysis (removing stopwords, URLs, user names etc.) We then unlisted each word from the Corpus and summed every instance of each word to create a list of words and their frequency of appearance in the Corpus. Finally we converted it to a term document matrix and plotted a chart of the most common words (words occurring more than 50 times or appearing in 5% of all Tweets).

#Brexit and the Immigration Debate

Arguably the most heated debate about the effects of a British exit from the EU have been around immigration control. We conducted a similar method as before, creating a term document matrix and plotting the most frequent terms (terms appearing more than 75 times in the dataset or in 7.5% of all Tweets).

Topic Modelling the Debate

Finally, we examined the use of topic modelling to summarise the key topics in the dataset. For this, we implemented both the Latent Dirichlet Allocation and Gibbs Sampling methods and compared their results.

Topic modelling is often useful for summarizing large amounts of data however I feel as though its performance and usefulness is hard to show due to the timescale of the Tweets. Over such a short period the data is prone to large spikes in the popularity of a single Tweet or small cluster of related Tweets, leading to an misrepresentation of the true nature of the Brexit debate over the entire duration of, say, the last two or three months.

In or Out?

The future of Britain’s relationship with the EU will be shaped tomorrow and will have long-lasting implications for trade, travel, employment. Through various data analysis techniques we’ve been able to provide a snapshot of attitudes and opinions toward the referendum.