Note: I have found a problem with this calculation. The national-poll-based Clinton win probability is closer to 70%. I will have an update and explanation soon.

Obama: "Next year someone else will be standing here in this spot, and it's anyone's guess who she will be." #WHCD pic.twitter.com/XVFxZUzKeJ — Huffington Post (@HuffingtonPost) May 1, 2016

General-election matchup polls (e.g. Clinton v. Trump) started to become informative in February. In May, they tell us quite a lot – and give a way to estimate the probability of a Hillary Clinton victory.

First, let us examine the primary evidence. Wlezien and Erikson have gathered presidential preference polls from 1952-2008:

These graphs show that during the year of the general election, polls gradually converge to a point that is close to the actual November outcome.

Wlezien and Erikson expressed their findings in terms of correlation coefficients. In early February (about 280 days from the election), the correlation between polls and November outcomes is +0.2, where 0.0 corresponds to no relationship and +1.0 indicates a perfect relationship. The correlation rises to +0.9 by October. However, this measure is not easily used by consumers of polls.

Instead, a more intuitive measure is how far polls tend to move over time.

To calculate this box-and-whisker plot I also included 2012 data (spreadsheet here). Positive values indicate that the Democratic candidate did worse in November than in polls. The box indicates the interquartile range, i.e. the middle 50%, and the whiskers indicate the range. The red points indicate two outliers: the elections of 1964 (Johnson v. Goldwater) and 1980 (Carter v. Reagan v. Anderson). In May, polls overestimated support for the Democratic candidate by over 10 percentage points. For obvious reasons, Republican-leaning pundits like to write about 1980. But that is one case out of 16 elections.

Instead of such cherrypicking, it is more accurate to include them as part of an analysis of all 16 elections. The full range and estimated standard deviation of poll-outcome differences looks like this:

On average, polls have little or no bias relative to November, but have some variation, which is what we care about. That variation is quantified by the standard deviation (SD). I estimated SD using median absolute deviation (MAD), and verified this approach using interquartile range divided by 1.35. For March and April, the standard deviation is around 4 percentage points.

The November outcome should be within 1 SD of current polls approximately two-thirds of the time. Hillary Clinton’s polling margin over Donald Trump is currently +8% (median of 19 pollsters since mid-March) – twice the standard deviation. Based on past years, how likely is it that Trump can catch up? It is possible to convert Clinton’s lead to a probability using the t-distribution*, which can account for outlier events like 1964 and 1980. Using this approach, the probability that Trump can catch up by November is 9%, and the probability that Clinton will remain ahead of Trump is 91%**. This probability doesn’t take into account Electoral College mechanisms. But since the bias of the Electoral College is quite small, it does not make a difference in the calculation.

I should note that the polls have been telling us this information for some time. In the first half of March, Clinton led Trump by a median of 9 percentage points. Using an SD of 4.5 percentage points, her win probability would come out as 93%. So today’s estimate has been knowable for several months.

This is a result that may excite Democrats. However, it is subject to change. For example, the SD increases to about 7% in June, which combined with a lead of Clinton +8% corresponds to an 83% win probability, less certain than today. And of course the polls could change. I don’t know why polls would be less predictive in summer. Maybe general election campaign events drive polls away from where they would naturally go otherwise. Post-convention bounces would be examples of such events.

This estimate is also independent of other factors, such as the state of the economy and Clinton and Trump’s net favorability/unfavorability. Most such factors should already be partially baked into the polls, and therefore might not add much information. Now that polls are predictive, they give us a more direct measure of what will happen in November.

*In MATLAB: prob=tcdf(clinton_trump_margin/4.5,3). In Excel: =1-TDIST(clinton_trump_margin/4.5,3,1)

**Modified to allow for the possibility of systematic error in polls. I assumed that polls will be off systematically by +/-2%, even on Election Eve. Calculating effective SD using the formula sqrt(SD^2 + 2*2), gives an effective standard deviation of 4.5% instead of 4%.