Predicting the Democratic Primary Winner

A Monte Carlo simulation of the final 19 contests

In a presidential primary season marked by outsider candidates, bizarre theatrics, and plot twists suited for an episode of House of Cards (or better yet Veep), one of the most surprising things has been just how predictable the Democratic race has been. While few media commentators foresaw Bernie Sanders’ impressive success this far, the race itself has played out in a largely predictable manner according to polling and demographic data.

Upsets like Sanders’ near record breaking performance in Michigan (where he beat his polling average by more than 20 points) have been cause for much hand-wringing by pollsters and celebration by Sanders fans. But looking at the race as a whole, as a race for delegates and not states, both candidates have performed largely as expected.

One of the most consistent predictors of the overall race has been FiveThirtyEight’s delegate tracker. This tool, rather than relying on polls (which are spotty or nonexistent in many primary and caucus states), creates benchmarks for each candidate in each state based on demographics. To put it simply, FiveThirtyEight created a minimum target for each candidate to win a majority of elected delegates by predicting best-case-scenarios in each state based on its demographics and past voting behavior.

The reason FiveThirtyEight's delegate benchmarks are so notable is because they have predicted the overall course of the Democratic primary with surprising accuracy. Hillary Clinton has, on average, only over-performed her targets by 8 — 10%, while Sanders has under-performed his by the same margin. The consistency of these predictions is all the more impressive considering that the targets were set back in January and haven’t been altered since.

Based on the strength of this model and it’s consistency in the past, we can predict a range of likely outcomes in the final 8 weeks of the primary season with something called the Monte Carlo method.

The Monte Carlo Method

The basics of the Monte Carlo method are simple. Imagine you have a series of independent events, whose outcomes are each based on a probability distribution. We can create a statistical prediction of a string of such events by randomly sampling outcomes, based on individual probabilities. In this case the sequence of events are a series of primaries. For each upcoming primary we can make a prediction about a candidate’s performance by making a random “guess” of how many delegates s/he will win. After we make a sample prediction for each primary we add them up to get a sample delegate total for our trial run.

A single simulated run is not itself predictive, the same way as flipping a coin 10 times and getting 8 heads does not indicate that outcome is statistically likely. But if we repeat the process over and over again, perhaps thousands of times with the help of a computer, a pattern should emerge. These outcomes, together, represent a probability distribution, from which we can state the general likelihood of a given outcome.

Simulating the Primary Season

Let’s start with how the candidates currently stand.

Total elected and unelected delegates by candidate: Clinton (1,428/502), Sanders (1,151/38). Figures as of April 25, 2016.

In total there are 4,766 delegates who will eventually vote at the Democratic Convention. Of these, 4,051 are elected (pledged) delegates and 715 are unelected (so-called “super”) delegates. 2,581 elected delegates have already been assigned in 38 previous state and territorial contests, but 1,400 remain to be distributed between now and June 14.

Hillary Clinton is currently leading the Democratic primary in terms of both elected and unelected delegates. Sanders’ 275 elected delegate deficit may seem slight, but it means he has to win at least 59% of the elected delegates in the upcoming primaries.

For this analysis we can factor out the much maligned superdelegates. Even though Clinton has amassed 516 of these unbound delegates (versus Sanders’ 39), these delegates are not technically obliged to vote for her and could switch allegiance at any time. We’ll just assume that these delegates will follow tradition and vote for whoever is leading in elected delegates when voting is over. The target for both candidates is therefore 2025 elected delegates (a simple majority of elected delegates) after the last contest in Washington D.C. on June 14.

To set-up our simulation we first assume that each candidate has an equal likelihood of either over-performing or under-performing their FiveThirtyEight delegate targets. It’s important to note that this assumption already biases our model slightly in Sanders’ favor. So far in the race he has actually been more likely to under-perform his targets by about 10 points. However, with his continued rise in national polling, I’m going to give Bernie the benefit of the doubt.

The next step is determining how much a candidate will over/under perform his or her target and by what probability. To do this I looked at all of the primaries thus far and the margin of difference between a candidate’s delegate target and the actual number of delegates won state-by-state (all data available on this Google Sheet.)

It turns out that the most either candidate has over- or under-performed her/his targets was by 25 percentage points. This occurred in Mississippi when Clinton won 88.9% of the available delegates vs. her target of 63.9% (meaning Sanders had the inverse -25%), and in Alaska when Sanders won 81.3% over his target of 56.3%. But the majority of state delegates have been won or lost by much narrower margins. In fact, if we plot the variance we get something not too far from a normal distribution.

A plot of the frequency distribution of the margin of error between actual results and delegate targets in the primaries so far.

This model is inherently limited because there are only 38 past state and territorial results to include in the sample. But using the data at hand we can presume that the state-by-state margin of error between actual vs. target delegate percentage is a normal distribution with a mean of 0.0 and a standard deviation of ±8.33%.

A plot of a normal distribution with marked standard deviation (σ). The area under the curve represents how likely a given outcome is. A random sample would have a 34.1% chance of falling between 0 (the mean) and 1 standard deviation. (Source: Wikipedia)

If you don’t remember your Statistics 101, that means that there is a 68.2% chance a random sample will return a delegate target error margin less than ±8.33%, a 95% chance of it being less than ±16.66%, and a 99.7% change of it being less than ±25%.

The final step is to create a simulation that randomly samples this distribution for each upcoming primary state. I did this using Python. A sample “run” is created by randomly generating delegate wins for each candidate in each state and summing all the results to get a final delegate total. The simulation is looped to repeat this process for thousands of runs.

For this simulation I forced my computer to live through a Groundhog Day-esque nightmare, repeating each primary between April 26 and June 14 10,000 times.

So Who Wins?

Sorry, Bernie, it doesn’t look good…

Out of all the trial runs I simulated, Senator Sanders finished the race ahead in elected delegates just 2.9% of the time. That of course means Secretary Clinton finished ahead (either by a little or a lot) 97.1% of the time.

Here is the distribution of final elected delegate totals: