An in-depth examination of what went wrong and how to fix the polls

The long awaited report into the 2015 pollpocalyse finally came out the other week and the question of everyone’s mind is now answered: why did the polls show so few Conservative voters? Because they didn’t interview enough Conservative voters!

Now, the inquiry did actually answer some questions. If there was a Conservative swing, it would have been a fairly modest one. Differential turnout (where all those Labour would-be voters actually stayed home) was probably not an issue. And pollsters were probably herding, a bit, but not on purpose, maybe.

While a lot of work was put into the report by some very smart people, that won’t stop me from digging deeper. It can be quite easy to point to bits and pieces of analysis that corroborate your theory, so I’m trying to look at the bigger picture, setting aside the fact that polls got a particular party vote wrong and looking at the overall patterns of survey answers. And in the end, hope to show that the real problem isn’t about Conservative vs. Labour votes, but about turnout.

A good source of data is the British Election Study, with the main benefits of being a stratified random sampling study and quite close to the election (started interviewing the day after the election, for about 3 months). The BES randomly selects respondents from the adult British population (with some practical limitations). As it happens, the BES got both turnout and vote choice close (enough) to the actual numbers, which the polls did not. A similar study, the British Social Attitudes Survey ran later in the year with similar results – but that data isn’t available yet. Since both studies use random sampling and weighting and they got it right, while the polls use quotas, and then weighting, and they got it wrong, there must lie the solution. But how? A theory is not a good theory unless you can specify the underlying mechanism. Conservatives do not just decide in unison to stop answering polls. So what makes one less likely to answer polls? A neat trick is to look at these random sampling studies that got it right and divide respondents according to how difficult it was to reach them (I first saw this approach in John Curtice’s The benefits of random sampling). It might not be ideal, but until someone asks the general public how often they answer polls, this proxy will have to do.

All analyses in this piece will be run on unweighted data. The purpose is not get a representative view of the British public, but to analyse the samples themselves.

For the BES, respondents were called on average over 3 times until conducting an interview. The majority of respondents were contacted once or twice, with 90% of the sample being called up to 7 times. The theory goes that those more accessible respondents, who were called once or twice, are also more likely to have been contacted by pollsters and they’re to blame for the avalanche of press articles and blog posts about how the polls got it so wrong. So are they more Labour? Well, sort of. In the chart below I plotted the Conservative lead over Labour across the number of times it takes to contact respondents. While the Conservative lead does increase as respondents are more difficult to contact, after the 4th call it actually drops.

We’d have a better view by looking at the actual vote recall spread among those more and less accessible for surveys. The general trend seems to be of increasing Conservative votes, by 5 points between those answering the first call versus those contacted the 4th time. But the trend is much more noticeable for Labour votes, steadily increasing as respondents become more difficult to contact, by up to 10 points by the time you get to 8 or more calls.

Surprisingly, votes for 3rd parties actually decrease among those less reachable. But the striking difference becomes obvious if you take into account those who did not actually vote. Absenteeism among those easily reachable (1 call) is 22%, when for the 2015 elections some 34% of those eligible to vote did not turn up. Only for those being called a 6th time does the number of voters resemble actual turnout. If pollsters only get to talk to members of the public that are easy to contact, then they are getting an inherently skewed sample. Polls continuously overestimate turnout, yet no one seems to mind until they get the votes wrong.

I propose we look for factors that make it difficult for respondents to get contacted until we find vote is no longer an issue. The rationale is that we need to find other factors that explain respondent accessibility in order to find a representative sample, that will then be used to find our variable of interest, vote. I’m using a negative binomial model to predict the number of times it takes before a respondent is contacted. Coefficients are interpreted just as you would a logistic regression – there is a reference category and all other categories in that variable are compared to it. A coefficient larger than 1 indicates that a category is that more difficult to find than the reference category. I’m also showing 95% confidence intervals, which indicate if the estimate is different from 1 or not.

Which brings me to the first model. As we’ve already seen from the chart above, it is actually Labour voters who are more difficult to contact. What’s more, non-voters are 30% more difficult to contact than voters of 3rd parties. Interestingly, Conservatives don’t actually seem to be more difficult to contact. The chart below shows the predicted number of calls needed to get to each type of voter – and since there are no other variables in the model, they are the average number of calls made for that category. It shows that Conservative and other 3rd party voters are actually similarly easy to find, while Labour and especially non-voters are more difficult.

The next step is to control for some basic demographics. The second model controls for: gender, region, age, home ownership and working status. People living in Scotland and North East are contacted faster. 25 to 44 year olds are more difficult to find, as are those renting from private landlords and people working full time. And interestingly, adding socio-demographic controls reduces the impact of party vote. There are no significant differences between how difficult it is to contact Labour and Conservative versus other party voters. Yet non-voters are still more difficult to reach even when controlling for a number of demographic factors, 15% more difficult when comparing to 3rd party voters.

And finally, the most obvious theory out there is that survey respondents are more interested in politics, so they are more likely to vote and their preferences are different. The final models includes party identification (whether respondents identify with any of the party or neither) and the number of days respondents discuss politics. Both have an impact on the number of calls needed to reach a respondent.

Respondents who discuss politics more often are easier to contact (the estimate is lower than 1) and those who claim to have no party identification are 15% more difficult to find than partisans. So looking at our variable of interest, including the 2 factors into the model reduces the differences between how difficult it is to contact different types of voters, as also shown in the chart to the rightabove.

But this is by no means the clear cut story I was hoping for – and frankly it was probably never going to be. Even when controlling for political interest, non-voters are slightly more difficult to contact (though not significant at 95% confidence level). And other measures for political interest (like interest in politics, caring about GE2015 results, having undertaken political activities or reading about politics) just do not have a clear impact on how easy it is to contact respondents. Yet, based on this analysis I would say political interest definitely matters and is likely to make respondents more eager to answer polls. Nevertheless, controlling for that is not as easy as asking whether respondents read politics in the Guardian. Instead, it will probably mean calculating voting probabilities by including more specific measures and measuring political engagement better – more active engagement, less attitudinal. Moreover, my own view is that more needs to be done to hide the purpose of the survey. Disguising polls within more general surveys should help get to those less interested respondents. And of course, the need to realise it’s not all about the final scores printed in the papers. The failure to get a good representative sample can’t be ignored anymore and getting too many voters will, in the end, mess up the headline figures.

My syntax, here. The BES2015 dataset, here. Any comments, feedback, constructive criticism are more than welcome, please use the section below.