KEY POINTS FROM THIS ARTICLE

— Two approaches to forecasting — one formally statistical, one rigorous yet flexible handicapping — produce different tools that we can use to evaluate the battle for control of the U.S. House in the 2018 midterms.

— The Crystal Ball and other political handicappers use a “qualitative” method to generate ratings of individual seats using election news, candidate evaluation, and some hard data. Others use quantitative modeling to produce probabilities of how likely it is for one party or the other to win each House seat.

— The quantitative model described below is more bullish on the Democrats’ House prospects than the Crystal Ball’s race ratings, but both indicate considerable uncertainty about which party will win a House majority this November.

— Those following this year’s House elections would be wise to take into account both qualitative race ratings, like those done by the Crystal Ball, as well as quantitative models, like the model described below, when assessing the race for the House.

Introduction

To understand the differences between quantitative, data-driven predictions and those made from traditional, data-influenced handicapping, one should direct their attention to the names of two websites: Sabato’s Crystal Ball at the University of Virginia Center for Politics, and my blog, The Crosstab. One is a reference to the soothsayer, a fortune-teller who stares into their glass ball and derives the fate of an event by evaluating some known and unknown factors. The other is a reference to the contingency table, a common tool in survey research that breaks down responses to a question by subsets of responses to another. The names of mine and this website are coincidentally descriptive of the ways in which our predictive methods differ.

In forecasting outcomes there are both harms and benefits to these two approaches. This piece evaluates those differences in the context of the 2018 midterm elections and delivers some much-needed attention to that which is common, not just contrasting, between the two.

Before I begin, allow me to highlight a guiding principle of this article. The contrast between my quantitative and the Center for Politics’ “qualitative” handicapping of the 2018 House elections is mostly the difference between continuous predictions (those that assign a chance to outcomes between 100% win or 100% loss) and binary predictions (those that assign either “win” or “lose” to a party). Whereas the method I employ uses data to generate a probability of victory for Democrats in all 435 U.S. congressional districts and their chance of winning the majority of seats, the Crystal Ball’s method tells you that either Democrats or Republicans are favored to win a particular race (technically, the Crystal Ball’s method is a discrete method, with set degrees of certainty on both sides, but is more proximate to binary than continuous prediction). Keep this difference between continuous (even better: distributional) and binary/discrete projections in mind.

This piece is broken up in three sections. In the first, I go through what the two methods take into account when projecting race outcomes. I then divulge the differences between what they tell us, and in the final section I break down the differences in the models’ past and current forecasts.

Inputs

Both my and the Crystal Ball’s methods of predicting the 2018 midterm elections to the United States House of Representatives are processes that (1) take in information, called inputs; (2) do something with that information; and (3) spit out other information, called outputs. If you remember ninth grade mathematics, these are both called functions (Dr. Seuss had a youthful explanation of functions that I remember from high school pre-calculus). However, after this rough categorization, the two functions diverge considerably.

My model to forecast the 2018 United States house midterms is a probabilistic statistical model that takes in a variety of input and, through four stages, produces outputs. The overall approach was developed by political scientists Joseph Bafumi of Dartmouth College, Robert Erikson of Columbia University, and Christopher Wlezien at the University of Texas at Austin. You can read their paper here. The model performs its estimation in four stages (note that the technical details of my model differ slightly from Bafumi et. al.’s methodology):

Calculate an estimate of the national environment today: Compute a weighted average of all congressional generic ballot polls taken for the 2018 cycle so far. Compute the average change in post-2016 special elections from the previous Democratic margin in a seat to the margin in the special election. Repeat this for every day of every year going back to 1992. Predict the national environment on Nov. 6, 2018: Use generic ballot polling averages at this point in past cycles… …combined with the average special election swing, again at this point in past cycles … …to generate a prediction of the national vote on election day. The final projection has around a six-point margin of error today. Use a variety of inputs to predict results at the district level: Create a baseline projection for every district by combining: The partisan lean of a district (a method developed at FiveThirtyEight that averages a seat’s 2016 democratic presidential win/loss margin with its 2012 presidential margin, weighted 75%/25% to put more emphasis on the more recent cycle); The previous candidate’s margin in the district; Candidate-specific variables, like whether an incumbent is running or if one candidate is significantly qualitatively “worse” than the other. Swing this baseline projection of the district the appropriate amount left/right, determined by the projected Democratic margin in the national vote from step 2.3. Simulate 50,000 election outcomes: For each trial, vary the estimated national popular vote randomly according to the margin of error of past predictions of the national vote. Add that error to each seat uniformly (NY-15, the seat where Hillary Clinton did the best in 2016, gets swung just as much as TX-13, the seat where Donald Trump did the best). Vary the forecast Democratic margin in each seat according to error that is correlated between districts. This accounts for the chance that our forecasts have more error in red than blue districts, white than minority districts, educated than uneducated districts, etc. Add up the number of seats Democrats win. Repeat this 50,000 times. The percentage chance that Democrats have of winning the election is simply the number of times they win 218 seats or more (a bare majority) divided by the total number of trials. Each seat has its own win probability generated the exact same way (by keeping a list of seats won/lost in each trial).

On any given day, the numbers generated by my forecasting model represent the best predictions we have of Democratic win margins at the national level and in each House district, and the chance that those projections will err. Remember, these projections are continuous, with outcomes occurring along a distribution of possibilities and each seat having a specific probability of victory. In the end, I produce a dataset of continuous vote shares and win probabilities, ranging from 0% to 100%, for every House seat in the nation and the nation itself.

The process by which the UVA Center for Politics generates its race ratings is different, however, and does not follow such a strict, formal statistical methodology.

The analysts at UVA consider myriad data, some quantitative and some not — often on different scales, e.g. how do you compare previous Democratic win margin to the following headline: “A gay Republican, the child abuse he sanctioned, and the homophobia used to defend him” — to come up with their projections. The analysts use a number of factors, including electoral history, polling, candidate quality, modeling, and district news in the method they use. The ratings ultimately reflect their judgment about the likelihood of one side or the other prevailing in a given contest.

Outputs

As discussed, these two approaches to forecasting — one formally statistical, one rigorous yet flexible handicapping — produce different tools that we use to understand upcoming elections. Whereas I rate each seat on a scale from 0% to 100% for the likelihood that it is won by Democrats, the team at UVA produce ratings that lie on a discrete scale: from Safe, Likely, and Lean Republican, to Toss-up, to Lean, Likely, and Safe Democratic. To evaluate what these differences might mean in November 2018, it is useful to explore what they meant last time around.

The Past: Accuracy of seat ratings and forecasting models

The big question everyone wants answered is: How likely is it, say, for a “Lean Republican” seat to be won by a Democrat? What about a Likely, or better yet, Safe, Republican seat? One would hope that races rated differently would convey different win probabilities for Democrats and Republicans. Indeed, they do.

To determine how well the UVA Center for Politics election ratings matched election outcomes over time, I combined their historical ratings going back to 2004 with actual results in House districts (made available by the MIT Election and Data Science Lab). The results are shown below.

Figure 1: Accuracy of race ratings by category

Notes: This figure stacks each district over Democrats’ actual November vote margin in the seat depending on its race rating from Sabato’s Crystal Ball. Ratings for all elections since 2004 are included where available.

You can see that there are certainly differences between UVA’s Safe, Likely, and Lean categories on both sides of the aisle; safer districts are less likely to see large upsets, and seats that lean toward either part are sometimes, though not frequently, won by the opposition. Overall, the ratings are relatively well calibrated, and 89% of rated seats not categorized as Toss-ups end up being won by the party that is favored to win.

If I want to compare the Center for Politics rating with my own ratings, however, I need to put them on the same continuous scale. I do so by simply taking the average Democratic win margin and raw probability of victory for every category of race rating. The figure below shows the results of this analysis.

Figure 2: Converting House race ratings to probabilities of victory

Notes: This figure graphs the implied Democratic win margin and win probability for each race rating category. To get an implied forecast of Democratic win margin and win probability, I calculated the average win margin, standard deviation, and percent of the time Democrats win for each race-rating category over all House elections since 2004. Points on the graph are sized by the number of contests in that category.

In the left panel of the graphic, I show that each category has an identifiable point estimate and band of uncertainty (or confidence interval) surrounding it. Lean Democratic seats are won, on average, with a six-point Democratic margin, for example; Lean Republican seats give GOP candidates a seven-point average margin; Likely Democratic seats give Democratic candidates a 14-point average margin, and so forth.

Each of these categories also has a corresponding Democratic probability of victory for the seats placed within. In Lean Democratic seats, Democrats win the elections 78% of the time; Lean Republican: 18%; Likely D: 95%; Likely R: 3%; Safe D: 99%; Safe R 0%, and Toss-up districts: 59%. These values are plotted on the right of the preceding figure, with the size of each point showing the number of seats earning that designation over the years.

It is apparent that qualitative seat ratings have provided good forecasts of Democratic win margins and probabilities in the past, but how do they compare to the predictions generated by my formal statistical model? Below, I recreate the probability-by-ratings figures for seat ratings generated by re-running my 2018 U.S. House midterms model for the 2016 House elections. Specifically, seat ratings are assigned for each seat according to the forecast win probabilities for each seat: if both parties have a win probability below 60%, the seat is considered a Toss-up; Lean Democratic/Republican seats are those with a win probability below 80%. Likely seats are those with win probabilities below 95%. Seats rated as greater than 95% likely for either party are considered Safe R/D.

Figure 3: Probabilities of Democratic victory based on House race ratings

Notes: This figure shows the actual Democratic probability of winning for seats rated as Safe R/D, Likely R/D, Lean R/D, or Toss-up, with the rating derived from its forecast win probability. Points sized by the number of contests in that category.

What first stands out is how pro-Democratic the Toss-up category is. However, as there are only six seats in this category, this error is caused by the Democrats winning one more seat than they ought to (four out of six instead of three out of six) — a likely insignificant difference in the long term.

What is more important is the much higher proportion of Safe to Lean/Likely seats in the quantitative forecast than in the qualitative ratings. It should be noted that this could be partly due to the Center for Politics’ omission of ratings for some lopsided seats.

Of the 384 House elections that took place in states without redistricting their congressional boundaries prior to the 2016 election, my forecast predicted them with 98.7% accuracy, getting just five non-Toss-up seats incorrect three fewer seats in aggregate than they actually did in November. Two of 28 Likely Republican seats were won by Democrats, one of eight Lean Republican seats was won by Democrats, and two of 12 Lean Democratic seats were won by Republican candidates. The predictions for all 435 seats erred 10 times, making total error 2%.

It’s worth noting that the seat with the biggest (10 percentage points) error in my re-run 2016 forecast was AZ-1, which the Center for Politics correctly predicted would be won by now-Rep. Tom O’Halleran (D) instead of Paul Babeu (R) — he’s the candidate referenced in “A gay Republican…” headline cited above. The UVA projections picked other Republican seats as Democratic pickups that I did not, and ended up over-shooting the Democrats’ number of seats by seven last cycle, while I low-balled them by three seats. I pick the AZ-1 example as it displays the biggest weakness of quantitative forecasting: the difficulty of accounting for deficits in candidate quality in a data-driven fashion. However, this is not as large an issue as one might think, given the overall record of the modified Bafumi et. al. method.

To be sure, what if neither method alone is the correct answer? For the sake of completeness, if you had combined the forecasts with a method that accounts for the uncertainty in both projections (using a Bayesian update to the normal distribution of outcomes — certainly a good, but not the most sophisticated way, to do so), you would have predicted the 2016 elections spot on, with Democrats being projected to win 194 seats in the U.S. House, though six individual projections were wrong but canceled each other out.

The figure and table below depicts the quantitative forecast, ratings-based forecast, blend of the two, and final result in each of the top 20 closest districts in the 2016 U.S. House elections.

Figure 4 and Table 1: Blended House forecasts in 20 closest 2016 House races

Notes: This figure shows estimates for the 2016 House elections according to different methods. The “blended” forecast is a Bayesian update to the normal distribution with the quantitative forecast being used as the prior, the seat rating being used as the likelihood, and the resulting posterior estimate and credible interval being used as the final prediction and margin of error.

Above, seats with lines that cross zero are the ones where the blended prediction “missed” the result, though it should be noted that all the outcomes fell within the margin of error.

It should be noted that although the combination of both ratings and the forecasting model are a useful tool for understanding U.S. House elections, the predictiveness of the blended predictions is worse earlier in the cycle. This is due to seat ratings moving less predictably than other indicators (like national congressional polling) and producing more noise in the estimates in, say, June of the election year, rather than late October or November. In other words, this method only works better than the forecast alone when the final race ratings for House seat are made available.

So, what do we know at this point in the piece — and in the 2018 midterms cycle?

First, the data are clear that discrete seat ratings perform ever-so-slightly worse than the data-driven quantitative forecasts, though both did correctly predict the outcome of the House majority in 2016. Second, it’s evident that probabilities are slightly more certain in the quantitative forecast than in the qualitative ratings and that the seat ratings give less room for flexible probability within categories. Third, I find that errors in the quantitative analysis are sometimes controlled for in the discrete ratings, though errors exist elsewhere to cancel out some of these gains. Finally, a blend of both measurements provides the best seat-by-seat understanding of the midterm elections.

Given the track record of both approaches, it is worthwhile to consider the following: what are the differences in the outputs of both models today?

The Present: Different forecasts for different seats

Given the different methods employed by myself and the team at the University of Virginia Center for Politics and the differences in forecasts in 2016, one should expect that there are variations in the predictions for November 2018 as well.

Indeed, there are (some) large differences in our forecasts. Though a portion of the discrepancies can be explained by the inability of qualitative ratings today to adjust for movement in the national environment — which my forecast does, and which pushes expectations toward the party out of power — and some can be explained by my quantitative method not taking well into account the quality of some district’s nominees, other differences can reflect real disagreement among the methodology.

The table below details the 15 districts where my forecasts and the UVA qualitative forecasts disagree — in other words, where one of us say the Democrats/Republicans are more likely to pick up a seat, and the other says Republicans/Democrats are the more probable victors.

Table 2: Differences between Crystal Ball and Crosstab House forecast Democratic win probabilities

However, many of these differences arise in seats that either of us rate as Toss-ups, accentuating differences between forecasts for contests where we’re actually quite uncertain about the outcome of the races. Here’s what that table looks like without Toss-up seats.

Table 3: Differences between Crystal Ball and Crosstab House forecast Democratic win probabilities, excluding Toss-ups

As you can see, where it matters most (in calling districts for either party), the methods are arriving at roughly the same conclusions. It is in Toss-up seats where the biggest discrepancies arise.

However, what really counts is the probability of victory assigned to either party — in each seat and in the nation as a whole — and the range of outcomes in the upcoming midterm elections.

Next, I answer the question of how many seats Democrats are likely to win according to the seat probabilities assigned by both methods. What’s going to happen in November?

The Future: Who’s going to win the majority?

Above all else, the main advantage of the quantitative model is the ability to simulate thousands of possible elections — some where Democrats do better, some where Republicans do, within the statistical range of error we’ve observed in the past — and generate a final probability for the chance that Democrats win a majority of seats. In its current form, the House ratings here at Sabato’s Crystal Ball (and elsewhere) are not able to compete with that (and perhaps, neither should they!).

However, the cool thing about the quantitative abstraction of the UVA seat ratings detailed above is that it comes with everything we need to be able to plug it into the simulation phase of the formal model. This way, we can account for the inherent error in the ratings while producing a nationwide probability that Democrats may win the majority of seats in the U.S. House of Representatives.

Better yet, instead of doing exactly what my model does (which produced some strange-looking seat outcomes due to inflated margins of error transitioning from the seat rating to average win margin), I can skip that step and work directly with the win probabilities derived from the past accuracy of ratings. For every trial, I model the expected change in probability from varying a district’s forecast vote margin according to (1) national error and (2) correlated seat error. I use the inverse normal distribution to make sure that probabilities are adjusted properly (seats rated as Toss-ups will see larger shifts in probability than Safe seats, for example).

Akin to the visualizations on my forecasting homepage, below I graph the range of possible seat outcomes for both models. The taller the line, the more likely it is that the Democrats win that number of seats.

Figure 5: Differing House forecasts

Notes: This figure shows the distribution of possible number of Democratic seats after the 2018 U.S House midterms according to the same simulation method applied to two sets of seat forecasts. Forecasts according to ratings and probabilities generated on June 7, 2018.

After simulating 50,000 trial elections with the seat-level win probabilities assigned by the UVA Center for Politics, we have our answer: Democrats are much more favored in the quantitative forecast (close to a 60% chance of winning the majority of seats) than in the discrete ratings (about a 40% chance). The expected number of districts won by Democrats is 10 seats larger in my own forecast than in the ratings at this website.

Why the difference? The discrepancy is explained by two major factors:

First, the quantitative method has the advantage of being able to look into the future, estimating where the national environment is likely to be in November this year by tracing movement in past election cycles from June until election day.

Second, there is a higher number of Lean and Likely Republican seats in my continuous, probabilistic data than in the UVA ratings. This explains the large right tail of possible Democratic seats graphed in blue above; there are more seats that Democrats can pick up in a large blue “wave” in my data, pushing expectations to the right.

You can see these differences in the cross tabulation (see what I did there?) below. Although mine and the UVA ratings agree on 118 Safe Republican seats, I rate 38 GOP-held districts as Likely or Lean that are rated as Safe here. Nine of their Lean/Likely ratings I give a Toss-up (between 40 and 60% chance of Democratic victory) designation.

Table 4: Comparing forecasts

This again shows the main limitation of the ratings-based approach, at least in trying to plug it into a purely quantitative forecasting method: the discrete scale of probabilistic ratings. Since each seat is put into a category that has a historical probability of voting for a Democrat or Republican, seats are not allowed to vary in the probabilities assigned to them. Even if one Lean Republican seats looks more competitive than another nearly Lean Republican seats, they are both placed in a bucket that elects Republicans 18% of the time.

On the other hand, my forecast generates a specific win margin and win probability for each individual seat, so NC-02 and PA-10 can have their own specific probabilities assigned to them (36% and 22%, respectively) on a continuous scale. All of these differences add up to produce a forecast that is more optimistic about Democrats’ future in the U.S. House of Representatives.

Closing thoughts

This piece has reviewed the methods, history, and current projections of two different prediction methods — a crystal ball and a statistical model — for this fall’s midterm elections to the U.S. House of Representatives. I have reviewed differences in probabilities between the measures, accuracy in the 2016 elections, and even prognosticated about the future of the House according to the two varying processes. What is clear is that the two vary considerably in some parts, and are similar in others.

Not any one approach is god’s gift to election handicapping, however. Both mine and the UVA Center for Politics forecasts have erred in the past, some of which cancel each other out and some of which do not, and a combination of both projections performs best in predicting the final partisan breakdown of seats. Indeed, even within methods, there is variation; the Cook Political Report and Inside Elections race ratings, as well as those published by media outlets like CNN all have disagreements — sometimes large ones — about ratings in some key districts. A new statistical forecasting model published by my soon-to-be colleagues at The Economist also has differences with my own personal method. While these differences can individually sometimes mislead, the truth frequently lies between them all. Apart from the technical details, there are also important differences in how we conceptually utilize continuous and discrete forecasts (some scientific, some journalistic) that this article does not discuss.

As we head into the heat of the summer of this 2018 midterm cycle pundits, politicos, and voters alike should take note of the past, present, and future differences between quantitative forecasting methods and the typical race-based handicapping. If the past holds true, the former will do well at producing precision probabilities for each U.S. House seat based on its individual characteristics, and the latter will do well at reducing large deviation from the forecast that typically arises from issues with candidate quality and rapidly changing districts.

Whatever method you pick (if you’ve learned anything from this piece, you ought to pick both), rest assured that the two are well-tested methods that will get us 90-98% of the way to foreseeing what will happen on November 6, 2018. It’s the remaining 2-10% that will make or break House forecasting this fall.