The ultimate aim of this will be to create a model that uses this demographic data to predict which party is likely to win in each constituency. Once we have developed an accurate model, we can analyse it to see what the most important drivers were, thus inferring the biggest electoral factors.

Before we do this, however, it can be eye opening to investigate the geographical distribution of some of these KPIs, and the regional disparities that they demonstrate.

For example, we see that wages are higher in constituencies in London and the South East, which should be no surprise to any UK resident. However, house prices have grown so much in these places compared to the rest of the UK, that the average property is worth up to 30 times the annual wage (compared with 5–10 times elsewhere).

Where you live also correlates strongly with the type of industry you’re likely to work in.

Blue collar work (for example, heavy industry and hospitality) is more prevalent in the Midlands and North of England — white collar industries (financial services, IT, and sciences) are clustered in cities, especially London. Perhaps unsurprisingly, constituencies with more white collar workers have higher wages, and lower unemployment.

Amber markers suggest lower wages, purple suggest higher wages

This ‘London vs. The Rest’ narrative runs through much of the data; Londoners tend to be better qualified, more ethnically diverse, and more international than the rest of the country.

These factors all come into play when we think about the great bogeyman of current British politics: Brexit. This is not the place to discuss any benefits (or lack thereof) of leaving the European Union, but to ignore its effect on the 2019 election would clearly be a mistake.

The first thing to note is the diversity of opinion that Brexit generates. Though the national margin of victory was very narrow, at 52% vs 48%, this vote spread was not evenly across constituencies. The leave vote was as low as 20% in some areas, but as high as 75% in others.

Unsurprisingly, this spread appears to be dependent on geography. We note strong Leave votes outside of London — mostly in middle England.

The drivers for Brexit were complex and diverse, however, there are some individual features that correlate very strongly with a constituency’s leave vote — qualification levels, and industry.

The correlation coefficient between the share of leave votes in a constituency and the share of people with level 4+ qualifications (degrees or equivalent) is very negative, at -0.72, whereas the coefficient between the share of leave votes and the share of people working in heavy industry is very high, at +0.72.

Though one would have to deal with feature collinearity (which this dataset has in spades), I imagine you’d easily be able to create a model that perfectly predicts a constituency’s leave vote.

YouGov’s MRP modelling

Though analysis the EU Referendum is a rabbit hole we could happily spend time descending, the aim of the above data gathering was to create some predictive model for how each constituency would vote in a general election.

This will therefore become a categorisation problem, where the model will decide whether the constituency is a safe seat for a given party, or a marginal seat between two given parties (we define a seat as ‘marginal’ if the difference between first and second place is less than 15 percentage points).

We should also remember that part of this exercise was about working out if anything could have been done during the campaign to change the eventual result. For this reason, rather than using the final 2019 election results as our target, we will use the results of YouGov’s MRP Poll published on November 27th (just over two weeks before polling day).

MRP (Multilevel Regression with Poststratification) is an advanced polling methodology, and the YouGov poll in question had a huge sample size, with over 100,000 people responding in the space of seven days. It was therefore able to give a projected vote share for each party in every constituency. In the event, it slightly underestimated the Conservative success in seat terms, though its projection of overall vote share was within 1–2% of the final result for the main parties.

A Sankey Diagram showing how the seat types were going to change from the 2017 election compared to the 2019 election, based on YouGov’s MRP poll projections

YouGov’s MRP poll showed a harrowing picture for Labour. A third of the seats that had previously been Conservative / Labour marginals now looked like they were safe Conservative, and up to 30% of seats that had been safe Labour in 2017 now looked as if they were going to be Conservative marginals.

By contrast, the Conservatives were set to lose very few safe seats, and they were still competitive in most seats that had previously been marginals.

Modelling the 2019 General Election

Now that we have features and target values for each constituency, we can do the final two steps:

Build a model that predicts the target as accurately as possible for each constituency, given the feature set.

Investigate the model using Permutation Importance, to see which features most impact the model’s accuracy.

As a first step, we need to address the dataset’s class imbalance problem. The dataset currently has over 250 Conservative safe seats, but only four Liberal Democrat safe seats (and one each of four other types of seat).

This is not an appropriate dataset to train a model on — we would likely overfit a model to the majority classes. To combat this, we will use the SMOTE algorithm to generate synthetic datapoints for the minority classes. This is relatively straightforward to implement in Python.

#import the library

from imblearn.over_sampling import SMOTE #create a SMOTE object

smote = SMOTE(k_neighbors=3) #use the .fit_sample() method to create a new feature and target set that has an equal number of datapoints from each class

X_smote, y_smote = smote.fit_sample(X, y)

Note, since SMOTE uses a ‘nearest neighbour’ algorithm, we can not synthesise for classes where there was only one datapoint to begin with. We will therefore drop such classes from our analysis going forward (these only represent half a percent of the seats in parliament, so this shouldn’t affect the analysis unduly).

Given that our analysis will depend on our ability to train a highly accurate model, we should both:

Try different types of algorithm Try different hyperparameters for each type of algorithm

The easiest way to do this is through a ‘grid search’ approach, where we specify all the different values for each hyperparameter that we want to test out, then let the code iterate through them in turn, to see which combination produces the best results.

This is quite easy to implement in code, and if we use Scikit Learn, it also comes with bonus in-built cross validation, which means we don’t need to bother with a train/test split.

#Here, we will perform a gridsearch using the XGBoost algorithm #Bring in the required libraries

from sklearn.model_selection import GridSearchCV

import xgboost as xgb #Create an XGBoost object

XGB = xgb.XGBClassifier() #Define a 'grid' of parameters that we want to test

#This is done using a dictionary

param_grid = {

"eta": [0.01, 0.05, 0.2],

"min_child_weight": [0, 1, 5],

"max_depth": [3, 6, 10],

"gamma": [0, 1, 5],

"subsample": [0.6, 0.8, 1],

"colsample_bytree ": [0.6, 0.8, 1]

} #Define a GridSearchCV object

#This links the model object and the parameter grid

gs_xgb = GridSearchCV(XGB, param_grid, cv=4, n_jobs=-1) #Fit the data. This will iterate through the grid

#and find the model with the highest accuracy

gs_xgb.fit(X_smote, y_smote) #Extract the best model as an XGB object

model = gs_xgb.best_estimator_

Though conceptually straightforward, grid search can be computationally intensive and time consuming. If we have six hyperparameters that we want to tune, each with three possible values, that’s 729 potential models. If we then do 4-fold cross-validation, then we’re actually fitting some 3,000 different models.

You can improve performance by setting the ‘n_jobs’ parameter in GridSearchCV to -1 (this forces it to use all of the computers processors in parallel), but this is still the sort of code that takes hours to run.

Happily, in this case it was worth the wait — we created a model that predicts seat types with 100% accuracy.

Unpacking the Black Box

This is all well and good, but we now need to investigate which features are driving the model to make its decisions. There might be a way to unpick, say, a simple decision tree or a logistic regression model. But XGBoost, for all its accuracy, is notoriously opaque.

Machine learning explainability is a topic that is becoming increasingly important, especially as models become simultaneously more complex, and more open to public scrutiny (how, for example, can you prove that your bank’s credit scoring algorithm isn’t racist?)

I highly, highly recommend this free course on Kaggle (approximately four hours’ worth of material), which covers three techniques that you can unpack any machine learning model. For this blog, I will use the most straightforward of these — permutation importance.

The concept of permutation importance is very intuitive:

Suppose we have a model, with a given loss function (for example, accuracy) of x

Now, take a single column in the feature set, and shuffle the values for each of the datapoints

Re-calculate the accuracy using this new dataset with the shuffled column

Note how much the model’s accuracy suffered from this shuffling. This loss in accuracy is the ‘permutation importance’ of that feature

Return the column to its un-shuffled state, and move onto the next feature — shuffling its values, and calculating how much the model’s accuracy falls

This procedure breaks the relationship between each feature and the target, thus the drop in accuracy is indicative of how much the model depends on that feature.

It’s worth noting that these values don’t tell you in which way the feature impacted the model’s decision for a given datapoints. For that, SHAP values can be used (see the Kaggle course for an introduction to these).

Calculating permutation importance can be done in Python using the eli5 package (though an equivalent function has been recently added to Scikit Learn).

#Import the required libraries

import eli5

from eli5.sklearn import PermutationImportance #Create a Permutation Importance object

#using our exisitng xbg model

#then fit it to our data

perm = PermutationImportance(xgb).fit(X, y) #Call the show_weights method

eli5.show_weights(perm,

feature_names = X.columns.tolist(),

top=15)

This outputs an ordered table, showing the features and their associated expected loss in accuracy. Note — the function performs many random shuffles of each feature, hence the ranges in the ‘Weight’ column.

The most impactful feature is whether or not a constituency is in Scotland. This should not be surprising given the Scottish National Party, who are competitive in 58 out of 59 Scottish seats.

The next most important is the share of people who voted to leave the EU. Let’s see how this breaks down depending on the seat type.

We can see how the leave vote share could be an important factor in deciding the seat type — there are stark differences across the categories. Two key things jump out from Labour’s perspective:

The seats that Labour was competitive in (i.e. Labour safe seats, plus its marginals) cover a very wide range of values, from 20% to nearly 70%. Thus, a single position on EU membership would have failed to appease all of its potential voters simultaneously.

The Labour / Conservative marginal seats look much more similar to safe Conservative seats than to save Labour seats. Thus, the Conservatives were able to have a more focussed Brexit policy, without the risk of alienating either its core base, or its potential voters in Labour marginals.

These two findings manifest themselves across many other KPIs with high permutation importance. Consider the two most important features pertaining to home ownership:

The potential Labour seats are spread over a very wide range of values in both cases, and, as with the Brexit vote, the Conservative / Labour marginals look much more like safe Conservative seats than they do safe Labour seats.

This is even true when we think about the geographical type of seat — the Conservative / Labour marginals were mostly in towns (rather than cities). Again, the profile of the marginal seats much better reflected existing Conservative safe seats.

All of this meant that the Conservatives were better placed to run a focussed campaign that would still appeal to a broad range of constituencies. Labour, however, were forced to appeal to a very wide range of voters, which came across in their campaign, which came across as muddled and too wide-ranging.

These demographics are not likely to shift around any time soon, and Labour is obliged to take back these marginal seats if it ever wants to win an election. Does this mean that it is doomed to failure?

Uniting the various types of voter that might support Labour looks like a difficult task, especially when the Conservatives are able to be much more targeted. If nothing else, it would take seriously talented leadership, and there is significant evidence to suggest that Labour was sorely lacking this in the 2019 election.

There is a sea of anecdotal evidence that Jeremy Corbyn, their leader, was profoundly unpopular on the doorstep during the campaign. More scientific polling since the election seems to bear this out.

To win back power then, Labour’s next leader is going to have to be nothing short of spectacular. No pressure, then, for the candidates already lining up to replace Mr. Corbyn, and the party members who will ultimately try to pick a winner from them.