This will explained later. Guess which one is 2016?

It’s become cliché that unusually many prominent people died in 2016. Is this true? To answer this we need to know:

(The easy part) What is unusually many? (The hard part) What is a celebrity?

The BBC analysis

For their analysis, the BBC defined celebrities as those with a pre-prepared obituary. That is, a pre-written ready-to-run obituary. Given this definition, it certainly looks like an usually high number of prominent people died in 2016:

But couldn’t this just be due to an increasing number of pre-prepared obits, or some other long-term trend? You can try to account for this by interpolating from 2012 to 2015 (I used a logarithmic trend— a quadratic gave similar results). Thus, I’d expect 36.4 celebrities to die in 2016. 49 did.

Using the obvious Poisson interpretation, P(Deaths ≥ 49) = 0.026. So a 1 in 40 year freakiness.

Just taking January to April gives an even more extreme picture. I’d predict 13.7 deaths — instead there were 24. This has a probability of just 0.007. The specific choice of January to April stinks of data-dredging, but I’m still kinda impressed.

Wikipedia and prominence

I’m unsatisfied with the pre-prepared BBC obit as a metric of celebrity:

It has a British bias (although it’s obviously impossible to be entirely objective.) When do they prepare obits? Maybe they just happened to write a load in December 2015. The decision to prepare an obit still remains the subjective opinion of a few bods at the BBC. Maybe the 2016 deaths were merely unusually expected, thus had obits ready.

Wikipedia to the rescue!

Maybe Wikipedia biographies would be a good source? Noteworthy people should have long and carefully-tended articles.

My analysis is similar to the book Who’s Bigger?. You may just want to skip my article and read that book.

Using C#, the Wikipedia API, and plenty of regexes, I extracted a list of prominent deaths from each year’s summary page, eg https://en.wikipedia.org/wiki/1992#Deaths . This gives a total of 6475 people, or roughly 20 a month. Then I used the Wikipedia API to get the lengths of these biographies in bytes, and the number of revisions per article.

I probably hit the web API pretty hard, so I made a small donation out of guilt :(.

Article length and revisions as a measure of prominence

For those dying since 1987, these are the 11 longest biographies. Note I’m only using the English Wikipedia:

This is kinda unsatisfactory. Johan Cruyff’s long football career gives him a long, detailed article, but is he really more significant than Michael Jackson? Michael Jackson has 8x as many revisions as Johan Cruyff, I presume this is because people pay him 8x as much attention.

These are the 20 articles with the most revisions:

Ah, that’s better! Every one a mega-celebrity. Note three are from 2016.

But now I found is a bias towards contentious figures (such as Indian guru Sathya Sai Baba), and those whom the man in the street has a lot to say about. Some important long-dead figures have good biographies that were rapidly and conclusively written in a few sessions by scholars — surely they deserve recognition?

A few other random bits:

The longest biography on Wikipedia is of Belgian astronomer Eric Walter Elst. It tediously lists thousands of asteroids that he discovered, but has few revisions.

When plotting Revisions against Lengths, we can see that there is a good correlation between Revisions and Lengths. The Spearman rank correlation-cofficient is 0.884 — quite high.

Looking at revisions and lengths there is an exponential trend. That is, something like 80% of the length/revisions is in 20% of the articles.

Most Wikipedia editors are American, male, nerdy, and young. I suspect.

I’m only using the English Wikipedia. My analysis is Anglocentric. And US-centric.

My definition of celebrity

Neither article-length nor number-of-revisions seems ideal. Therefore I define one’s Celebrity as the harmonic mean of the logarithms of your article-length and number-of-revisions, each normalised by the maximum you can achieve in each category.

A maximal celebrity will score 1.0. Unknowns will score 0.0.

The harmonic average has the nice property that it biases against those with unusually high scores for Length or Revisions. So a person with a very long article that has only been revised a few times is probably an anomaly, and will score poorly. Likewise, a short biography that has been heavily revised will also score poorly.

Here are my top-30 based on this metric: