This is the story of an idea for a simple visualisation turning into a very fascinating challenge.

The Idea

There is this beautiful animated data visualisation where you can follow 1000 americans on what they do in a day. It might no be the most efficient way of visualising this kind of data but it sure is mesmerising and fascinating.

So I had the idea: Why not visualise voting patterns of Germans over the last 20 years in the same way. We follow 1000 points (people) that are representative of the German population and see how the voting changes flows looks like. In the end I got there — but there were quite some challenges along the way.

The Data

I am not aware of any panel dataset of a representative study in Germany that tracks individual voting behaviour over time (maybe the SOEP study?). However, the company infratest.dimap (who is conducting the exit polls for the German TV channel ARD) estimates the total voter shift between the parties. As the company doesn’t coherently report the voter change statistics (btw. why not infratest?) I had to find the data for each election after 1998 in individual publications.

Nice visualisations like that only exist for the two most recent elections. Source: Tagesschau

Gathering this data left me with a dyadic dataset where the total numbers of estimated voters shifting from one party to another (as well as new voters, deceased voters, and non-voters) for all elections after 1998.

Dyadic dataset of voter shifts

From summary statistics to individual voting behaviour

As we want to show how 1000 representative voters move from party to party over time we need my favourite thing in the world: simulation.

From the dyadic dataset I calculated transition matrices for all elections (with the probability for the change from one state to another).

i.e. the transition matrix for the 2017 election. the rows describe where the voters are coming from (2013) and the columns describe where the voters are going to.

With the transition matrices at hand I can simulate the voting behaviour of one person as a Markov-chain process where the transition matrices change over time (i.e. every election). Each individual is a random walk along this Markov chain. In this case this is quite easy to implement as the new state after an election is generated by a random draw from a election-specific multinominal distribution which is based on the previous state (i.e. as you can see above: a random person who voted SPD in 2013 has a 52.81% probability of voting SPD again) .

To start the simulation I generated an array of 1000 people which is proportinal to the total numbers of voters in 1998.

The starting values for the simulation are 1000 individuals representative for the election result of 1998. Plus 203 individuals who will become a first-time voter in any of the following elections.

For each election we also need to add new voters, which I also added in proportion to the voters in each election.

This leaves me in total with 1203 individuals with changing party preferences which are overall aligned with the “shift of vote” statistics by infratest.dimap.

Voting journey for 6 simulated individuals

Simulation Problem

There is a problem with the time-varying markov chain though: Every step (=election) introduces more variance and the distribution of simulated voters does not necessarily match the distribution of the actual voters. If the N gets larger the distributions will converge. However, for a N= 1000 as we need it for the visualisation, it does not. So how can we ensure that the simulated data actually represents the real voter distributions?

For this purpose I do the simulation 1000 times and calculate the Chi-Squared statistics for the difference between the distributions based on the simulation and the real distributions. Then I took the simulation output with the lowest Chi-Squared value. As a result:

The simulated data shows

~ representative shifts of vote between parties for each election AND

~ representative distributions of voters for each election

Is that useful? I don’t know. Is it scientifically relevant? Probably not. Does it make a beautiful visualisation? Yes, it does.

The visualisation using D3.js was then quite easy, as I followed a tutorial by flowingdata.com (paywall).

Limitations

There are some general problems with the infratest.dimap estimation of “shift of votes” as explained here.

A general problem with my kind of Markov Chain is that it has no memory: The states before the last election don’t influence a future state. This leads to anomalies in the data: e.g. if a simulated first-time-voter votes CDU or SPD in their first election, there is a 10% chance that they will be dead by the next election, but when they vote GREEN they have 3 % chance of dying.