Fast Company recently ran an article [Link] on regression models used to predict medal counts. The modeling was performed by Dan and Tim Graettinger, two brothers who work for Discovery Corps, Inc. The Graettinger brothers discuss more modeling details on their blog post [Link].

They predicted medal counts in two steps. First, they used logistic regression (which is useful for modeling events with binary outcomes) to predict which countries would medal and which would not. They found that performance medaling in the prior Summer Games was the strongest predictor of whether a country would medal in the Winter Games:

At the last two Winter Games, no nation won a medal without having won at least one medal in the preceding Summer Olympics. I never expected that! Our predictive model would ultimately fill in a zero for the anticipated medal count in Sochi if the nation did not win a medal in London. Also during the profiling stage, we saw other variables rise to the top: migration rate, doctors per thousand people, latitude of the capital city, value of the nation’s exports, and some measures of gross domestic product. Ultimately, once we built our logistic model, it had a 96.5% correct rating.

I suppose this makes sense. Big countries all medal in the Summer Games, so the profiling stage essentially picks “small” countries that will medal and that will not. The underlying mechanism for why this works might reflect government and social support for Olympic training programs of any kind. Good support would make medaling in both the Summer Games and Winter Games more likely. But this also implies that the Jamaican bobsled team has a chance due to the excellent Jamaican sprinters.

The next step was to take the output of the logistic regression model–the countries that would medal–and use linear regression to predict how many medals they could expect to walk away with. Because the number of events (and medals) changes every four years, the number of medals is somewhat meaningless. They had to rescale to rescale the output to make consistent predictions using historic data.

The four variables used to predict medal counts (for countries expected to medal) are:

geographic size (Russia, China, US). This was somewhat surprising, but could reflect geographic diversity: big countries probably have mountains somewhere where athletes can train. GDP per capita (it’s the economy, stupid) the value of its exports (it’s the economy, stupid, part 2) the capital city’s latitude (Norway, Sweden, Finland!)

Here is the data used by the linear regression and the output:

Linear regression is a simple model that makes a lot of assumptions that may not be valid. Something like negative binomial regression may be more accurate (it is often used to model call counts). In my experience, other more appropriate regression models don’t improve upon linear regression by a whole lot, so I expect that their results and insights would not be greatly affected by this modeling choice. But I’m also not a regression guru, so other comments here would be appreciated (leave a comment!)

I’m going to mention validation in case any of my students are reading (I’m teaching undergraduate simulation this semester). It’s easy to build a bad model, and validation is useful for avoiding a model that spits out nonsense.

The modeling is interesting and fun to do, but nearly all of the work involved collecting and assembling the data. This will not be a surprise to you if you have worked on a project with real data. I have also emphasized this point in the course I’m teaching this semester.

“If he had known how long it would take to assemble the data,” Dan tells Co.Design, “maybe Tim would’ve told me to work on something else.”

Wall Street Journal model

The Wall Street Journal also has a model for predicting medal counts [Link]. They use Monte Carlo simulation to generate medal counts. They are light on methodology details, but it looks like they use complete different data than the Graettinger brothers: they mention success probabilities at the individual athlete level, which are then probably aggregated into country medal counts. Here is what they predict.

After the Olympics, we will have to revisit these predictions and see which did better.

For more reading on Punk Rock OR:

how to predict how many records will be broken in the Olympics (from 2012 ) : Cox Proportional Hazard model!