State and National Poll Aggregation Pierre-Antoine Kremp

Twitter: @pakremp

Last update: Thursday, November 3, 3:37pm ET.

This is a Stan implementation of Drew Linzer’s dynamic Bayesian election forecasting model, with some tweaks to incorporate national poll data, pollster house effects, correlated priors on state-by-state election results and correlated polling errors.

For more details on the original model:

Linzer, D. 2013. “Dynamic Bayesian Forecasting of Presidential Elections in the States.” Journal of the American Statistical Association. 108(501): 124-134. (link)

The Stan and R files are available here.

1268 polls available since April 01, 2016 (including 960 state polls and 308 national polls).

Electoral College Note: the model does not account for the specific electoral vote allocation rules in place in Maine and Nebraska.

National Vote This graph shows Hillary Clinton’s share of the Clinton and Trump national vote, derived from the weighted average of latent state-by-state vote intentions (using the same state weights as in the 2012 presidential election, adjusted for state adult population growth between 2011 and 2015). In the model (described below), national vote intentions are defined as: \[\pi^{clinton}[t, US] = \sum_{s \in S} \omega_s \cdot \textrm{logit}^{-1} (\mu_a[t] + \mu_b[t, s])\] The thick line represents the median of posterior distribution of national vote intentions; the light blue area shows the 90% credible interval. The thin blue lines represent 100 draws from the posterior distribution. From today to November 8, Hillary Clinton’s share of the national vote is predicted to shrink partially towards the fundamentals-based prior (shown with the dotted black line). Each national poll (raw numbers, unadjusted for pollster house effects) is represented as a dot (darker dots indicate narrower margins of error). On average, Hillary Clinton’s national poll numbers seem to be running slightly below the level that would be consistent with the latent state-by-state vote intentions.

State Vote The following graphs show vote intention by state (with 100 draws from the posterior distribution represented as thin blue lines): \[\pi^{clinton}[t,s] = \textrm{logit}^{-1} (\mu_a[t] + \mu_b[t, s])\] States are sorted by predicted Clinton score on election day.

Current Vote Intentions and Forecast By State

State-by-State Probabilities

Map

Pollster House Effects Most pro-Clinton polls: Poll Origin Median P95 P05 Saint Leo University 2.6 1.4 4.0 Public Religion Research Institute 1.8 0.7 2.9 AP 1.7 0.4 3.1 Michigan State University 1.7 -0.3 3.8 RABA Research 1.7 0.5 3.0 GQR 1.4 0.4 2.4 ICITIZEN 1.4 0.2 2.7 McClatchy 1.4 0.1 2.9 Baldwin Wallace University 1.2 -0.6 3.2 Siena 1.2 -0.2 2.7 Most pro-Trump polls: Poll Origin Median P95 P05 Rasmussen -2.4 -3.0 -1.8 UPI -1.9 -2.4 -1.3 Remington Research Group -1.8 -2.7 -0.9 Clout Research -1.7 -3.7 0.2 Hampton University -1.7 -3.2 -0.2 PPIC -1.7 -3.3 -0.3 Emerson College Polling Society -1.6 -3.1 -0.1 IBD -1.6 -2.7 -0.6 InsideSources -1.6 -3.6 0.3 Dixie Strategies -1.5 -3.1 -0.1

Discrepancy between national polls and weighted average of state polls

Data The runmodel.R R script downloads state and national polls from the HuffPost Pollster website as .csv files before processing the data. The model ignores third-party candidates and undecided voters. I restrict each poll’s sample to respondents declaring vote intentions for Clinton or Trump, so that \(N = N^{clinton} + N^{trump}\). (This is problematic for Utah). When multiple polls are available by the same pollster, at the same date, and for the same state, I pick polls of likely voters rather than registered voters, and polls for which \(N^{clinton} + N^{trump}\) is the smallest (assuming that these are poll questions in which respondents are given the option to choose a third-party candidate, rather than questions in which respondents are only asked to choose between the two leading candidates). Polls by the same pollster and of the same state with partially overlapping dates are dropped so that only non-overlapping polls are retained, starting from the most recent poll. To account for the fact that polls can be conducted over several days, I set the poll date to the midpoint between the day the poll started and the day it ended.

Model The model is in the file state and national polls.stan . It has a backward component, which aggregates poll history to derive unobserved latent vote intentions; and a forward component, which predicts how these unobserved latent vote intentions will evolve until election day. The backward and forward components are linked through priors about vote intention evolution: in each state, latent vote intentions follow a reverse random walk in which vote intentions “start” on election day \(T\) and evolve in random steps (correlated across states) as we go back in time. The starting point of the reverse random walk is the final state of vote intentions, which is assigned a reasonable prior, based on the Time-for-change, fundamentals-based electoral prediction model. The model reconciles the history of state and national polls with prior beliefs about final election results and about how vote intentions evolve. Backward Component: Poll Aggregation For each poll \(i\), the number of respondents declaring they intended to vote for Hillary Clinton \(N^{clinton}_i\) is drawn from a binomial distribution: \[ N^{clinton}_i \sim \textrm{Binomial}(N_i, \pi^{clinton}_i) \] where \(N_i\) is poll sample size, and \(\pi^{clinton}_i\) is share of the Clinton vote for this poll. The model treats national and state polls differently. State polls If poll \(i\) is a state poll, I use a day/state/pollster multilevel model: \[\textrm{logit} (\pi^{clinton}_i) = \mu_a[t_i] + \mu_b[t_i, s_i] + \mu_c[p_i] + u_i + e[s_i]\] What this model does is simply to decompose the log-odds of reported vote intentions towards Hillary Clinton \(\pi^{clinton}_i\) into a national component, shared across all states (\(\mu_a\)), a state-specific component (\(\mu_b\)), a pollster house effect (\(\mu_c\)), a poll-specific measurement noise term (\(u\)), and a polling error term (\(e\)) shared across all polls of the state (the higher \(e\), the more polls overestimate Hillary Clinton’s true score). On the day of the last available poll \(t_{last}\), the national component \(\mu_a[t_{last}]\) is set to zero, so that the predicted share of the Clinton vote in state \(s\) (net of pollster house effects and measurement noise) after that date and until election day \(T\) is: \[\pi^{clinton}_{ts} = \textrm{logit}^{-1} (\mu_b[t, s])\] To reduce the number of parameters, the model only takes weekly values for \(\mu_b\), so that: \[\mu_b[t, s] = \mu_b^{weekly}[w_t, s]\] where \(w_t\) is the week of day \(t\). National polls If poll \(i\) is a national poll, I use the same multilevel approach (with random intercepts for pollster house effects \(\mu_c\)) but I add a little tweak: the share of the Clinton vote in a national poll should also reflect the weighted average of state-by-state scores at the time of the poll. I model the share of vote intentions in national polls in the following way: \[\textrm{logit} (\pi^{clinton}_i) = \textrm{logit}\left( \sum_{s \in \{1 \dots S\}} \omega_s \cdot \textrm{logit}^{-1} (\mu_a[t_i] + \mu_b^{weekly}[w_{t_i}, s] + e[s]) \right) + \alpha + \mu_c[p_i] + u_i\] where \(\omega_s\) represents the share of state \(s\) in the total votes of the set of polled states \(1 \dots S\) (based on 2012 turnout numbers adjusted for adult population growth in each state between 2011 and 2015). The \(\alpha\) parameter corrects for possible discrepancies between national polls and the weighted average of state polls. Possible sources of discrepancies may include: the fact that when polls are not available for all states, polled states can be on average more blue or more red than the country as a whole (not a problem since the first 50-state Washington Post/SurveyMonkey poll in early September);

changes in state weights since 2012;

any possible (time-invariant) bias in national polls relative to state polls. The idea is that while national poll levels may be off and generally not very indicative of the state of the race, national poll changes may contain valuable information to update \(\mu_a\) and (to a lesser extent) \(\mu_b\) parameters. How vote intentions evolve In order to smooth out vote intentions by state and obtain latent vote intentions at dates in which no polls were conducted, I use 2 reverse random walk priors for \(\mu_a\) and \(\mu_b^{weekly}\) from \(t_{last}\) to April 1: \[\mu_b^{weekly}[w_t-1, s] \sim \textrm{Normal}(\mu_b^{weekly}[w_t, s], \sigma_b \cdot \sqrt{7})\] \[\mu_a[t-1] \sim \textrm{Normal}(\mu_a[t], \sigma_a)\] Both \(\sigma_a\) and \(\sigma_b\) are given uniform priors between 0 and 0.05. Their posterior marginal distributions are shown below. The median day-to-day total standard deviation of vote intentions is about 0.5%. The model seems to find that most of the changes in latent vote intentions are attributable to national swings rather than state-specific swings (national swings account on average for about 92% of the total day-to-day variance). Forward Component: Vote Intention Forecast Final outcome I use a multivariate normal distribution for the prior of the final outcome. Its mean is based on the Time-for-Change model – which predicts that Hillary Clinton should receive 48.6% of the national vote (based on Q2 GDP figures, the current President’s approval rating and number of terms). The prior expects state final scores to remain on average centered around \(48.6\% + \delta_s\), where \(\delta_s\) is the excess Obama performance relative to the national vote in 2012 in state \(s\). \[\mu_b[T, 1 \dots S] \sim \textrm{Multivariate Normal}(\textrm{logit} (0.486 + \delta_{1 \dots S}), \mathbf{\Sigma})\] For the covariance matrix \(\mathbf{\Sigma}\), I set the variance to 0.05 and the covariance to 0.025 for all states and pairs of states – which corresponds to a correlation coefficient of 0.5 across states. This prior is relatively imprecise as to the expected final scores in any given state; for example, in a state like Virginia, which Obama won by 52% in 2012 (a score identical to his national score), Hillary Clinton is expected to get 48.6% of the vote, with a 95% certainty that her score will not fall below 38% or exceed 59%.

State scores are also expected to be correlated with each other. For example, according to the prior (before looking at polling data), there is only a 3.4% chance that Hillary Clinton will perform worse in Virginia than in Texas. If the priors were independent, this unlikely event could happen with a 10% probability. The covariance matrix implies that the correlation between the 2012 state scores and 2016 state priors is expected to be about 0.94 (as opposed to 0.89 if covariances were set to zero). The simulated distribution of correlations between state priors and 2012 scores is in line with observed correlations of state scores with previous election results since 1988 [http://election.princeton.edu/2016/06/02/the-realignment-myth/]. To put it differently, the model does not have a very precise prior about final scores, but it does assume that most of this uncertainty is attributable to national-level swings in vote intentions. How vote intentions evolve From election day to the date of the latest available poll \(t_{last}\), vote intentions by state “start” at \(\mu_b[T,s]\) and follow a random walk with correlated steps across states: \[\mu_b^{weekly}[w_t-1, 1 \dots S] \sim \textrm{Multivariate Normal}(\mu_b^{weekly}[w_t, 1 \dots S], \mathbf{\Sigma_b^{walk}})\] I set \(\mathbf{\Sigma_b^{walk}}\) so that all variances equal \(0.015^2 \times 7\) and all covariances equal 0.00118 (\(\rho =\) 0.75). This implies a 0.4% standard deviation in daily vote intentions changes in a state where Hillary Clinton’s score is close to 50%. To put it differently, the prior is 95% confident that Hillary Clinton’s score in any given state where she is currently polling around 50% should not move up or down by more than 1.6% over the remaining 5 days until the election. Poll house effects Each pollster \(p\) can be biased towards Clinton or Trump: \[\mu_c[p] \sim \textrm{Normal}(0, \sigma_c)\] \[\sigma_c \sim \textrm{Uniform}(0, 0.1)\] Discrepancy between national polls and the average of state polls I give the \(\alpha\) parameter a prior centered around the observed distance of polled state voters from the national vote in 2012 (this was useful until early September, when lots of solid red states had still not been polled and the average polled state voter was more pro-Clinton than the average US voter.): \[\bar{\delta_S} = \sum_{s \in \{1 \dots S\}} \omega_s \cdot \pi^{obama'12}_s - \pi^{obama'12}\] \[\alpha \sim \textrm{Normal}(\textrm{logit} (\bar{\delta_S}), 0.2)\] Measurement noise The measurement noise term \(u_i\) is normally distributed around zero, with standard error \(\sigma_u^{national}\) for national polls, and \(\sigma_u^{state}\) for state polls. I give both standard errors a uniform distribution between 0 and 0.10. \[\sigma_u^{national} \sim \textrm{Uniform}(0, 0.1)\] \[\sigma_u^{state} \sim \textrm{Uniform}(0, 0.1)\] Polling error To account for the possibility that polls might be off on average, even after adjusting for pollster house effects, the model includes a polling error term shared by all polls of the same state \(e[s]\). For example, the presence of an unexpectedly large share of Trump voters (undetected by the polls) in a given state would translate into large positive \(e\) values for that state. This polling error will remain unknown until election day; however it can be included in the form of an unidentified random parameter in the likelihood of the model, that increases the uncertainty in the posterior distribution of \(\mu_a\) and \(\mu_b\). Because I expect polling errors to be correlated across states, I use a multivariate normal distribution: \[e \sim \textrm{Multivariate Normal}(0, \mathbf{\Sigma_e})\] To construct \(\mathbf{\Sigma_e}\), I set the variance to \(0.04^2\) and the covariance to 0.00175; this corresponds to a standard deviation of about 1 percentage point for a state in which Clinton’s score is close to 50% (or a 95% certainty that polls are not off by more than 2 percentage points either way); and a 0.7 correlation of polling errors across states.

Recently added polls Entry Date Source State % Clinton / (Clinton + Trump) % Trump / (Clinton + Trump) N (Clinton + Trump) 2016-11-03 ABC – 51.1 48.9 1074 2016-11-03 CBS – 51.6 48.4 1213 2016-11-03 IBD – 50.0 50.0 763 2016-11-03 Lucid – 53.0 47.0 714 2016-11-03 Rasmussen – 48.3 51.7 1305 2016-11-03 UPI – 50.5 49.5 1289 2016-11-03 SurveyMonkey AK 42.7 57.3 248 2016-11-03 SurveyMonkey AL 39.8 60.2 546 2016-11-03 SurveyMonkey AR 40.4 59.6 528 2016-11-03 U of Arkansas AR 37.8 62.2 480 2016-11-03 Saguaro AZ 50.6 49.4 1984 2016-11-03 SurveyMonkey AZ 50.0 50.0 1256 2016-11-03 Field CA 61.6 38.4 1548 2016-11-03 SurveyMonkey CA 65.9 34.1 1969 2016-11-03 SurveyMonkey CO 52.4 47.6 1370 2016-11-03 U Colorado Boulder CO 56.4 43.6 783 2016-11-03 SurveyMonkey CT 57.8 42.2 678 2016-11-03 SurveyMonkey DE 58.1 41.9 341 2016-11-03 Dixie Strategies FL 47.7 52.3 614 2016-11-03 Opinion Savvy FL 52.1 47.9 567 2016-11-03 SurveyMonkey FL 51.6 48.4 2640 2016-11-03 SurveyMonkey GA 50.0 50.0 2450 2016-11-03 SurveyMonkey HI 62.7 37.3 352 2016-11-03 SurveyMonkey IA 44.0 56.0 1030 2016-11-03 SurveyMonkey ID 38.0 62.0 349 2016-11-03 SurveyMonkey IL 59.6 40.4 887 2016-11-03 SurveyMonkey IN 41.4 58.6 687 2016-11-03 SurveyMonkey KS 43.5 56.5 955 2016-11-03 SurveyMonkey KY 38.5 61.5 578 2016-11-03 SurveyMonkey LA 42.0 58.0 480 2016-11-03 SurveyMonkey MA 66.3 33.7 755 2016-11-03 SurveyMonkey MD 70.8 29.2 687 2016-11-03 SurveyMonkey ME 55.3 44.7 450 2016-11-03 SurveyMonkey MI 51.7 48.3 1734 2016-11-03 SurveyMonkey MN 56.0 44.0 780 2016-11-03 SurveyMonkey MO 44.8 55.2 673 2016-11-03 SurveyMonkey MS 45.5 54.5 595 2016-11-03 SurveyMonkey MT 40.0 60.0 320 2016-11-03 SurveyMonkey NC 54.4 45.6 1697 2016-11-03 SurveyMonkey ND 36.0 64.0 224 2016-11-03 SurveyMonkey NE 40.5 59.5 503 2016-11-03 ARG NH 47.3 52.7 546 2016-11-03 MassINC NH 49.4 50.6 395 2016-11-03 Suffolk NH 50.0 50.0 420 2016-11-03 SurveyMonkey NH 56.0 44.0 553 2016-11-03 SurveyMonkey NJ 58.7 41.3 757 2016-11-03 SurveyMonkey NM 53.3 46.7 615 2016-11-03 SurveyMonkey NV 49.4 50.6 815 2016-11-03 SurveyMonkey NY 64.4 35.6 1580 2016-11-03 SurveyMonkey OH 47.1 52.9 1503 2016-11-03 SurveyMonkey OK 36.4 63.6 649 2016-11-03 SurveyMonkey OR 58.6 41.4 813 2016-11-03 SurveyMonkey PA 52.8 47.2 1938 2016-11-03 SurveyMonkey RI 55.8 44.2 336 2016-11-03 SurveyMonkey SC 47.8 52.2 1351 2016-11-03 SurveyMonkey SD 35.4 64.6 283 2016-11-03 SurveyMonkey TN 45.1 54.9 824 2016-11-03 Dixie Strategies TX 45.8 54.2 647 2016-11-03 Dixie Strategies TX 42.9 57.1 892 2016-11-03 SurveyMonkey TX 47.7 52.3 1802 2016-11-03 Monmouth University UT 45.6 54.4 273 2016-11-03 SurveyMonkey UT 47.6 52.4 786 2016-11-03 SurveyMonkey VA 54.5 45.5 1709 2016-11-03 SurveyMonkey VT 69.6 30.4 335 2016-11-03 SurveyMonkey WA 61.4 38.6 710 2016-11-03 SurveyMonkey WI 51.2 48.8 1093 2016-11-03 SurveyMonkey WV 31.4 68.6 284 2016-11-02 YouGov – 51.7 48.3 1097 2016-11-02 CNN AZ 47.3 52.7 715 2016-11-02 Lucid AZ 49.4 50.6 924 2016-11-02 Lucid CO 54.3 45.7 787 2016-11-02 CNN FL 51.0 49.0 742 2016-11-02 Quinnipiac FL 50.5 49.5 570 2016-11-02 TargetSmart FL 54.5 45.5 632 2016-11-02 SurveyUSA KS 43.7 56.3 543 2016-11-02 Lucid LA 48.2 51.8 510 2016-11-02 Michigan State University MI 62.7 37.3 560 2016-11-02 DFM Research MO 44.7 55.3 432 2016-11-02 Quinnipiac NC 51.6 48.4 548 2016-11-02 Lucid NM 55.7 44.3 397 2016-11-02 CNN NV 46.7 53.3 727 2016-11-02 JMC Analytics NV 50.0 50.0 540 2016-11-02 Lucid NV 54.2 45.8 740 2016-11-02 Quinnipiac OH 47.1 52.9 512 2016-11-02 DHM Research OR 54.7 45.3 378 2016-11-02 CNN PA 52.2 47.8 735 2016-11-02 Monmouth University PA 52.2 47.8 371 2016-11-02 Quinnipiac PA 52.7 47.3 557 2016-11-02 Susquehanna PA 51.1 48.9 599 2016-11-02 Hampton University VA 48.2 51.8 682 2016-11-02 Winthrop University VA 53.0 47.0 591 2016-11-02 Marquette Law School WI 53.5 46.5 1079