In this post, we will undertake a thorough exploratory data analysis of the crosstabs of national Democratic primary polling data, in order to better understand the Demographic characteristics of the Bernie Sanders 2016 coalition and the Hillary Clinton 2016 coalition. To do this, we will systematically compare the differences in support for Clinton and Sanders as compared to the overall support for each candidate across a large number of polls.

This is a prelude to further analysis I intend to do of the Democratic primary in future posts. I have collected polling data, census data, and exit poll data, and plan to use this data to analyze the current state of the Democratic primary race, and what it would look like under certain interesting scenarios — such as if Sanders pulls even to Clinton in national polls, or if either Sanders or Clinton has a narrow national lead. With this information, we can form realistic expectations as to how well each candidate needs to do in order to win across different states as well as across different congressional districts (where many delegates are allocated). With that information in hand, we will be able to see more clearly that Hillary Clinton’s presidential aspirations will not be doomed if she loses Iowa or New Hampshire (or even both). Similarly, Bernie Sanders will hardly be doomed if he fails to win South Carolina.

It is possible that Bernie Sanders won’t gain much more support in the polls nationally than he currently has. If so, Hillary Clinton will win the Democratic nomination. However, particularly if Sanders comes close or wins in Iowa and New Hampshire, he may gain momentum and the race could narrow nationally as the primary process goes on. If so, the race will become a question of states and delegates. It will be possible for either candidate to win nationally, even while losing by large margins in some areas and among some demographic groups.

In the 2008 democratic primaries, Barack Obama fairly consistently won — across different states and at different points in the primary process — certain demographic groups. Young voters, college-educated voters, higher income voters, African Americans, independents, etc tended to support Obama. This was “The Obama coalition.” Likewise, Hillary Clinton fairly consistently won — across different states and at different points in the primary process — certain demographic groups. Older voters, less educated voters, lower income voters, Hispanics, democrats, etc tended to support Clinton. This was “the Clinton Coalition.” Both of these coalitions had fairly stable demographic correlates, and this is why it was possible to predict the outcome in states that voted later in the primary season using the results from states that voted earlier in the primary season.

Now, according to lazy reporting and casual analysis, Bernie Sanders 2016 is just the same thing as the Obama 2008 coalition, except without the African Americans. And according to the same analysis, Hillary Clinton 2016 is just the same thing as the Hillary 2008 coalition plus African Americans. The reality is more complicated. Both Bernie Sanders and Hillary Clinton are drawing portions of their support from both the Obama ‘08 and the Clinton ‘08 coalitions.

Bernie Sanders in 2016 is a different candidate running in a different time, with different issues of public concern, than Barack Obama in 2008. In some ways, and to some groups of people, Obama had more appeal in 2008 than Sanders has now. But in other ways, and to other groups of people, Sanders has more appeal now than than Obama had in 2016.

Hillary Clinton is also — while obviously the same individual — not the same candidate in 2016 as she was in 2008. She is running on different policy positions, and is getting support from different groups of voters. And her coalition is not the same as her 2008 coalition, and cannot simply be understood as being the Hillary 2008 coalition plus African Americans.

So, what are the Hillary Clinton 2016 and Bernie Sanders 2016 coalitions, and how do they compare to the Obama 2008 and Clinton 2008 democratic primary coalitions?

There is a lot of information freely available that can help answer this question. One source of this sort of information is crosstabs from national polls. Unfortunately, this information is often ignored. What’s worse, in cases when reporters and political analysts do not ignore this information outright, they usually pull out a few data points from a single poll in isolation. These single data points can often vary wildly from poll to poll, because there are higher (often much higher) margins of error for sub-samples. When you add this to methodological differences between pollsters, a single crosstab from a single poll showing (for example) how much support Hillary Clinton or Bernie Sanders is getting from Hispanic voters may have little true informational content. So there is something to be gained by taking a large number of polls and more systematically comparing their crosstabs in order to get a better picture of how support for Bernie Sanders and Hillary Clinton actually varies by various demographic categories. Similarly to the way in which the Huffington Post Pollster or Real Clear Politics have time series polling averages, it can be useful to collect cross-sectional polling averages.

So, I collected crosstab data for every national poll I could find from November 1 to the present, for the demographic categores of region, party identification, race, gender, age, ideology, education, income, and neighborhood type. During this period, national Democratic primary polling averages have been fairly stable, at least up until the beginning of January, when Bernie Sanders began his ongoing surge in the polls:

x Embedded Content

So, the presumption behind this analysis is that while there have been some shifts in support between candidates, those shifts in support are unlikely to be wildly out of proportion with the pre-existing bases of support for each candidate. We are presuming that there is such a thing as a “Sanders 2016 Coalition” and a “Clinton 2016 Coalition” and that they are at least relatively stable in their internal composition. That is, when Sanders gained support in the polls in January, we are presuming that all of his gain did not come from just one demographic group, but rather was split in similar proportions to the people who supported him in November and December.

That presumption, of course, is not literally true. There are changes in relative voter preferences by demographic categories across time. There is also individual level heterogeneity lurking beneath every aggregated demographic category. But the judgement here that underlies this analysis is that the informational and predictive benefit that can be gained by ignoring (or de-emphasizing) those changes is outweighed by the informational and predictive benefit that can be gained by taking into account a large number of polls over a longer period of time.

And so that is what we will proceed to do.

Vote by Region

Before we get down to business, I will explain how to interpret the polling average charts. First, the polls that have crosstab data are listed; that should be straightforward. Below that, we show the averages across polls:

The “ Average by Region Crosstab ” lists the average support for each candidate in each region, from all the polls for which there is crosstab data. So, for example, Clinton’s support in the Northeast averages 50% in these polls. That is the average of 43, 60, 48, 36, 54, 61, and 50.

” lists the average support for each candidate in each region, from all the polls for which there is crosstab data. So, for example, Clinton’s support in the Northeast averages 50% in these polls. That is the average of 43, 60, 48, 36, 54, 61, and 50. The “ Overall Average ” is the average nationwide support from all the polls for which there is data in a given region. This can differ slightly for different crosstabs (regions) because for some crosstabs there may be more or less polls that provide crosstab data. An example of that here is that in the CNN poll, CNN only listed crosstab data for the South, but not for other regions. So when comparing regional performance in the South to national performance, we want to include the CNN poll, but for other regions we want to leave the CNN poll out of the overall average. Note — Since the CNN poll does not have crosstab data for other regions than the south, one might argue that it should not be included. It is indeed misleading, in these circumstances to directly However, what we are really primarily interested in is the cross-sectional variation between overall national support for each candidate and their support in each region, not in the absolute level of each candidate’s regional support (which will vary, along with national support, depending on which polls happen to have crosstab data). Including the CNN poll, even though the crosstab is incomplete, tends to add cross-sectional information about the difference between regional and national support, so in these sorts of cases I included the poll.

” is the average nationwide support from all the polls for which there is data in a given region. This can differ slightly for different crosstabs (regions) because for some crosstabs there may be more or less polls that provide crosstab data. An example of that here is that in the CNN poll, CNN only listed crosstab data for the South, but not for other regions. So when comparing regional performance in the South to national performance, we want to include the CNN poll, but for other regions we want to leave the CNN poll out of the overall average. Finally, the “Difference from Overall Average” is the “Average by Region Crosstab” minus the “Overall Average.” This measures the average amount, across polls, by which a candidate overperforms or underperforms in a given crosstab category.

It is important to understand that the absolute levels of these polling averages are not meant to reflect current levels of support. Instead, our purpose is to look at the differences between how a candidate polls overall, and how that candidate polls among various demographic sub-groups. If we can reasonably surmise that those differences will tend to be at least fairly stable over time, then we can get some approximate idea of how candidates will tend to fare among various demographic subgroups, if we know how well they are polling nationally overall. As one example, the polls with regional crosstabs average at 52% Clinton, 32% Sanders overall. And in the Midwest region crosstabs for those polls, Clinton averages 50% (2% less than her average overall) and Sanders averages 35% (3% better than his average overall. If we want to estimate the state of the race in the Midwest now, we can start from the current polling average (52.5% Clinton - 36.3% Sanders from Pollster). Then subtract 2% from Clinton and add 3% to Sanders, and we can estimate that if the national race stands at about 52.5% Clinton - 36.3%, then the race in the Midwest region probably stands at about 50.5% Clinton - 39.3% Sanders.

As we will see a bit later, applying a uniform swing in that way is probably not the best way to go about estimating the current levels of candidate support for a particular demographic crosstab like the midwest region, but you can use it as a simple rule of thumb.

There are significant regional differences in the distribution of Sanders’ and Clinton’s support across polls. The regions which deviate the most strongly from the national average are the South (more pro-Hillary than average) and the West (more pro-Bernie than average), but there are also differences in the Northeast and the Midwest.

In the Northeast, both Sanders and Clinton have had very slightly lower support than their national average (-2% for Sanders and -1% for Clinton).

The Midwest appears to be shaping up as a marginally strong region for Sanders (+3% relative to his national average), and a region of marginal relative weakness for Clinton (-2% relative to her national average).

The South is clearly Clinton’s strongest region, and the South is the region that differs most strongly from the rest of the country. On average, Clinton out-polls her national average by +4% in the South, while Sanders tends to be about -7% below his national average.

In the West, Sanders substantially over-polls his national average by +6%, while Clinton under-polls hers by -3%. So the West is a region of relative strength for Sanders.

Vote by Party Identification

There are clear differences in candidate support by party identification — the Sanders 2016 coalition tilts towards self-identified Independents, while the Clinton 2016 coalition tilts towards self-identified Democrats. On average, Clinton out-polls her average among self-identified Democrats by 4%, while Sanders under-polls his average among self-identified Democrats by about 3%. The other side of the coin is that Sanders substantially out-polls his average among self-identified Independents by about 11%, while Clinton does significantly worse than her average with Independents (-14%).

An note is that we are talking about party identification, which is not the same thing as party registration. In 2008, about 24% of Democratic primary voters nationally were self-identified Independents or Republicans. So any poll that does not allow for at least some self-identified Independents and Republicans — even in states with closed primaries that allow only registered Democrats to vote. As one of many examples, New Mexico has a closed primary in which only registered Democrats are eligible to participate, but nonetheless 11% of voters in the 2008 exit polls were self-identified independents and 3% were self-identified Republicans.

Vote by Race

Probably the most discussed and most controversial demographic category is candidate support by race. We have heard many bold claims about the preferences of voters of in different racial groups. So how do those claims hold up when we systematically compare a multitude of national polls? First, polls with race cross-tabs differ. Some combine all non-whites into one aggregated category, while others separate out African Americans and Hispanics (and sometimes other race).

Sanders out-polls his national average among white voters by about +6%, while Clinton under-polls her national average among white voters by about 6%. The flip side of that is that Clinton tends to over-poll her average among non-white voters (taken as an aggregated group) by about 9%, and Sanders has tended to under-poll among non-whites by about 11%. In 2008, about 33% of Democratic primary voters nationally were non-white (according to exit polls). Of those, about 17% were African American, 11% were Hispanic, and 4% were “Other” (including Asians, more than one race, and Native Americans).

However, as soon as we look at polls which disaggregate “non-whites” into African Americans, Hispanics, and Other, the picture changes, and we begin to see that non-whites are not a monolithic group. The bulk of Clinton’s support in excess of her national average among non-whites comes from one group in particular — African Americans. Among African Americans, Sanders has under-polled his national average by about - 15%, while Clinton has over-polled her national average by about + 14%.

But among Hispanics, in contrast to among African Americans, both candidates poll much closer to their national averages. Clinton only outperforms her national poling average by +3%, while Sanders only under-performs his national average by -3%. One caveat is that it can be difficult to poll Hispanics, so there is more uncertainty surrounding this average than there would be around other demographic groups. Nonetheless, with 7 polls taken into account here, there is substantial evidence that Hispanics have tended to be less pro-Clinton and more pro-Bernie than have African Americans. In fact, as we will see later, the “path of least resistance” in order for Sanders to win 50% of the vote nationally will most likely involve him winning something around 46% support from Hispanic voters nationally, with Clinton taking the other 54%.

Finally, both candidates over-perform their average slightly among “other” race voters. However, I would not make very much of this. “Other” is a small, highly heterogeneous group, and 3 of the polls with information are from Morning Consult. There is not much information to go on here.

The apparent heterogeneity among non-whites could present a problem for the Clinton campaign’s argument that when the primary moves to more diverse states, she will necessarily prevail. Rather, if national polling averages are accurate, Clinton’s relative strength is more concentrated among African Americans specifically, rather than among non-whites more generally.

Vote by Education

When we come to education, we find that the dominant media narrative is not really supported by the data, and should probably be re-examined. According to the simplistic media narrative, since President Obama fared well among college-educated voters and poorly among non-college educated voters in 2008, then Bernie Sanders should do the same. If anything, on the logic of that media narrative, we should expect the difference in vote by education to be greater in 2016 than it was in 2008. If it were true that Hillary’s 2016 coalition was simply her 2008 coalition plus African Americans and Sanders’ 2016 coalition was simply the Obama 2008 coalition minus African Americans, then since African Americans have benefited from fewer educational opportunities than other voters, this would mean that the Sanders 2016 coalition should consist of an even higher proportion of college educated voters than did the Obama 2008 coalition.

But across 9 national polls since November, that is fairly clearly not the case. So the data seems to be contradict that story. Why?

There is only a small difference in vote by education — Clinton over-performs with college educated voters by a measly 1%, while Sanders over-performs with them by a notably small +2% and under-performs his average by -2% with non-college educated voters. Interestingly, both Clinton and Sanders overperform very slightly with college educated voters — which indicates that much of what is really going on here is that more college educated voters are decided, while more of the undecideds are non-college educated.

Vote by Income

Across 8 polls, there is essentially no difference in presidential preference by income. Clinton performs the same as her national average with both higher income and lower income voters, while Sanders does barely better with higher income voters (+1%) and barely worse with lower income voters (-1%) in comparison to his national average.

As was the case with education, this data seems to contradict the dominant media narrative, in a fairly blatant way. According to the dominant media narrative, since Obama substantially overperformed among higher income voters and underperformed among lower income voters in 2008, we should expect the same for Sanders 2016. This casual intuition-based hypothesis is what lies behind the notion that Sanders’ support is particularly concentrated among high income, high education white liberals.

And as was the case with education, if that notion were true, then we should expect that the divergence in vote by income should be even greater for Bernie Sanders in 2016 than it was for Barack Obama in 2008. In 2014, median family income for White non-Hispanic households was $60,256, while it was just $35,398 for African American households. So if the Sanders 2016 coalition is simply the Obama 2008 coalition minus African Americans, then Sanders would rely even more on votes from higher-income voters than did Obama. But again, as with education, that is not what the polling data says. Indeed, the consensus across national polling data says quite the opposite. Instead, Sanders’ support in 2016 is less concentrated among high income voters than was the case for Obama in 2008.

Together with the data on education, this suggests something important about the Bernie 2016 coalition and how it differs from the Obama 2008 coalition. While Obama’s support among white voters was disproportionately concentrated among high income, high education voters, that is not the case with the white voters who are supporting Sanders in 2016. Instead, and unlike Obama, Sanders is drawing support from the lower income less educated white voters who supported Hillary Clinton in 2008 as well. This may also be true of why (unlike Obama in 2008) Sanders in 2016 is performing at close to his national average with Hispanics — because he is doing a relatively better job of appealing to lower income and less educated Hispanic voters than Obama was able to do.

Furthermore, to the degree that Sanders is faring any better with higher income voters than with lower income voters, this is being driven by the Morning Consult polls, which make up 3 of the 8 polls in the sample that provided crosstabs by education. Morning Consult is most likely the lowest quality pollster in the group, and these were the last polls I added. The other 5 polls, on average, have Sanders actually faring a bit better among lower-income voters than among higher income voters. But in either case, this is quite different from the Obama 2008 coalition.

Vote by Gender

Across polls, there is fairly consistent evidence of a fairly normal sized gender gap. Among men, Clinton under-polls her national average by about -4%, while Sanders over-polls by about +6%. Among women, Clinton over-polls her average by about +3%, and Sanders under-polls his average by -4%. The fact that the divergence is higher among men than among women simply reflects the fact that women typically make up the majority of Democratic primary voters, which means that the overall average is nearer to the preferences of women than of men. The presence of a gender gap of this sort doesn’t doom either candidate, nor does it particularly favor either Clinton or Sanders.

It is also worth noting that Clinton’s relative strength among women, among African Americans, and among older voters are all interrelated. In part because so many African American men are victims of the mass incarceration system, while others are disenfranchised former felons, a disproportionate share of African American voters are often women. Since women have greater life expectancy than men, Seniors tend to be disproportionately women. For these reasons, you will often see that in states that are older than average or more heavily African American than average, a greater share of the electorate will be women.

Vote by Age

To keep age simple, I divided the electorate into two categories — age 18-44 and age 45+. Breaking things down further into age 18-29, 65+, etc would have required more disaggregation, which would reduce the accuracy of these age polling averages. Poll crosstabs report many different age brackets, depending on the poll. For example, some report age 18-29 and 30-44, while others report age categories 18-34, 34-49, etc. Consequently, I had to aggregate (and in some cases disaggregate) data into consistent age blocks, which are weighted by the electorate reported in 2008 Democratic primary exit polls. In a few cases it was necessary to disaggregate age data (as in the case of splitting age 34-49). I did this by assuming that everyone in each age category reported by the pollster voted homogeneously, and that vote shares are in the same proportions as in 2008 Democratic primary exit polls. That means that actual age differences are probably very slightly (though we are talking about decimals) understated. To a lesser extent I had to do this same sort of thing to create comparable aggregated data for other crosstab categories. I did that in the same sort of way.

After aggregating the data into two consistent age categories across polls, a fairly consistent pattern emerges. Clinton substantially under-polls her national average by about -10% among younger voters (age 18-44), while she over-polls her national average among voters age 45 and up by 7%. Sanders under-polls among age 45 and up by -8%, and over-polls his average by 12% among voters age 18-44.

It might be the case that part of the age gap is caused by differences between young voters having been more likely to be familiar with Sanders, as a result of more widespread internet use among younger people. Supporting this idea, the age differentials seem to have diminished (while remaining substantial) in state polls in Iowa and New Hampshire, where the campaign is obviously most active.

Whether that is so or not, national polls have shown large differences in support by age since the beginning of November, with the possibility that this may be a bit less in early primary states.

Vote by Ideology

Across polls, Sanders tends to fare better among liberals, while Clinton tends to fare better among moderates and conservatives. Clinton slightly under-polls her national average by -2% among liberals, and very slightly over-polls by +1% among moderates. For Sanders, however, the divergence is a bit bigger — he over-polls his average by +7% among liberals, and under-polls his average by -5% with moderates and conservatives.

Vote by Neighborhood Type

The data by “neighborhood type” is the most limited of all the demographic categories we are analyzing here. There are only 6 polls, from only 3 pollsters. On average, Clinton fares a bit above her average in suburbs (+3%) and a bit below her average in rural areas (-4%), running even with her average in urban areas. Sanders fares slightly above his average in urban areas (+1), slightly below his average in suburban areas (-1%), and at his average in rural areas.

Since the three Morning Consult polls make up half the polls in the sample, and 3 of the 4 polls for which there is data on rural voters, it tends to skew the results. The two CNN polls and the IBD/TIPP poll on their own suggest that Sanders is somewhat over-polling his national average in urban areas and under-polling his national average in suburban areas.

So the data is less firm here than for other demographic categories, but to the degree that we can be confident in anything, it does generally seem to support the differences we have already found from 2008. In 2008, Obama tended to over-perform in urban and suburban areas, and to under-perform in rural areas. But from this data, it is not clear that there is much of a difference in candidate support by neighborhood type — and if there is, it suggests that in 2016 Clinton may even tend to over-perform her national average in suburbs (and under-perform elsewhere) — where Obama did well with upper income, more educated voters in 2008. Sanders, for his part, may fare relatively better in 2016 in many rural areas where Clinton dominated in 2008 than did Obama.

Conclusion: how the Bernie 2016 and Hillary 2016 Coalitions compare to the Obama 2008 and Hillary 2008 Coalitions

The Bernie 2016 and the Hillary Clinton 2016 coalitions both include portions of the Obama 2008 coalition and other portions of the Hillary 2008 coalition.

It makes for a short, simple soundbite to say that Hillary Clinton’s 2016 coalition is simply her 2008 coalition plus African Americans, and that Bernie Sanders’ coalition is simply Obama’s 2008 coalition minus African Americans. The trouble is, while there is an element of truth to that stereotype, as with most stereotypes, it is not actually supported by the available polling data.

When we systematically gather the available polling data and compare across polls, we can see a number of ways in which the idea that Hillary 2016 = Hillary 2008 + African Americans fails to capture the reality:

While Clinton is winning by a large margin among African Americans, she is not winning by the same sort of 90%-10% margin that Obama was able to achieve in the 2008 primaries. If Clinton had been able to win even 20% or 30% support among African Americans in the post-South Carolina 2008 primaries, things could have turned out differently. In 2016, even though Sanders is . While he is clearly not winning the support of most African Americans at this point, it is about vote margins Support for Sanders among white voters is not as concentrated among higher education, higher income voters as was the case for Obama. In contrast to Obama in 2008, Sanders’ 2016 support is split fairly evenly between voters without college educations and with college educations, and voters earning less than $50,000 and voters earning more than $50,000. Clinton in 2016 does not appear to have the same hold over Hispanic voters as was the case in 2008. Sanders only barely under-polls his national average with Hispanics, and Clinton only barely over-polls her national average with Hispanics.

All of these points obviously have to do with race, and with the fact that Obama is African American and Sanders is white. While Barack Obama benefited from extremely high levels of support from African Americans, Hillary Clinton was able to counter that by winning higher levels of support among lower income and lower education white voters and Hispanics, appealing to voters who racially resented Obama.

The media likes to concentrate on which candidate is “winning” which group of voters. The reality is, the outcome of elections depends on the mathematics of vote shares and vote margins — not just on how many groups a candidate “wins.” Though obviously it is necessarily to “win” at least some groups of voters, winning does not just depend on that, but also depends on the vote margins among groups that a candidate “loses,” and the shares of the electorate that different groups make up.

According to the dominant media narrative, the way that Bernie Sanders will win the Democratic nomination — if he were in fact to win it — would be the same way as Obama won in 2008. We currently hear many people saying that in order to win nationally, Sanders needs to do well in Iowa and New Hampshire and convert the momentum into a “win” in South Carolina by winning over African American voters en masse, as Obama did.

I suppose it is possible that this could happen. To some degree, Clinton may be doing well with African voters just because they are (still) more familiar with her than with Bernie Sanders. But if Bernie Sanders were to win the Democratic nomination, that would not be the most likely way in which it will happen. Indeed, that would be more than Sanders needs to do in order to win.

The path of least resistance to 50% plus one nationally for Sanders is different for Sanders in 2016 than it was in 2008 for Obama, because he is a different candidate, running in a different time, with voters concerned about different issues. That path does not simply mean getting unheard of masses of new voters to turn out, winning overwhelming margins among young voters, cranking out every last vote from college towns, and winning massively with Liberal voters — though it does mean getting some new voters to turn it, doing relatively well with young voters, doing well in college downs, and winning Liberal voters. It does not mean winning over African American voters en masse, though it does mean faring better with African American voters than Hillary Clinton fared in 2008. Likewise, the path for Clinton to hold on and win in 2016 is different than the path would have been for her to do so in 2008.

In future posts, we will look more closely at what those paths to victory for each candidate most likely are.