As you’ve probably noticed, over the weekend I created The Right Tool, for discovering how we describe programming languages. In response to massive amounts of data and feedback I’ve thrown together some pages for seeing the results of which statements describe which languages well.

There’s one set for which statements match a given language and one for how well each language matches a given statement.

The method I’m using is straightforward enough, but quite neat. It’s an adaptation of a method from “Rank Aggregation methods For The Web, by Dwork, Kumar, Naor and Sivakumar” which they refer to as MC4. I’ve modified it to better take into account shades of grey and to be more resilient to rare events.

It belongs to a general class of scoring mechanisms based on Markov Chains. Google’s pagerank is an instance of these. Essentially what you do is define a Markov chain on the things you want to score. You try to arrange it so that transitions to good nodes are more likely than transitions to bad nodes. You then find the stationary distribution for this markov chain (the long-term probability of being at a particular node). The idea is that if transitions to good nodes are more likely than transitions to bad nodes, the nodes which you spend more time in are more likely to be good. These probabilities then form a measure of what you are trying to rank.

So, that’s all very well, but how do we decide what markov chain to define?

In MC4 the markov chain is defined as following: When at node A, pick a node at random. If the majority of people agree that that node is better than this one, transition there. Else stay where you are.

This has a couple of problems.

The first is this: How do we decide what constitutes “the majority of people”? A flat out majority is very sensitive to small numbers of voters: If only one person ranked both of two languages and ranked it the “wrong” way, we’re still going to treat it as better. I tried various things to make this work: One was just a lower bound on the number of votes we counted, one was modelling as a Bernoulli trial and requiring N (for N = 1 or 2. 1 seemed to work slightly better) standard deviations above average. It all felt a bit arbitrary though.

The second problem was shades of grey: It makes absolutely no distinction between language A being a lot better than language B and language A being slightly better than language B. As well as creating weird biases, this results in the method being very bad at getting the tail right: You get a lot of languages which everyone thinks are the pits, and they all get assigned a score of 0, so there’s no way to distinguish between them. It happened that the approximation I was using somewhat mitigated this, but it was still a bit annoying.

So I’ve modified the chain definition as follows: First pick a language B at random. Now consider the probability of language B beating A in a comparison, and jump to B with that probability, else stay at A. This means that when there’s only a slight benefit to B over A the score of B is only slightly higher than that of A, because we’ve still got a decent chance of transitioning back (it turns out that with only two languages A and B in the chain where B beats A with probability p we assign a score of p to B and 1 – p to A).

One final refinement: Obviously we don’t know the “true” probability of B beating A, so we have to estimate that from the data. The way I’ve done this is a very simple weighted bayesian average. We estimate it as (0.5 * N + Number of times B beat A) / (N + number of times B and A were compared), where N is some number representative of how hard we are to convince of a probability (I think I currently have it set to 5). For very small samples we’re artificially forcing it closer to half, and even for larger samples we’re never allowing it to quite hit 0 or 1.

That’s about it. We now have these probabilities. The ranking is then simply defined by ordering from highest to lowest probability.

This seems to work pretty well. I don’t have good metrics in place, but the lists it’s generating “feel” better than any of the previous incarnations, and it’s conceptually pleasing. It also appears to be quite spam resistant, which is a nice touch.