The 2015–16 Golden State Warriors achieved the greatest NBA regular season of all time (73–9), beating out the 1995–96 Chicago Bulls (72–10) for the best record ever. Naturally, the Dubs’ historic achievement has led to much discussion about the greatest NBA team of all time (GNTOAT), whether league-wide talent is better now than it has been in the past, and, obviously, who would win in a head-to-head match-up between Stephen Curry’s Warriors and Michael Jordan’s Bulls.

Finding myself surprisingly motivated to solve these pressing questions, and having recently been introduced to the Elovation App, I aimed to resolve the arguments once and for all.

Numbering the Beast

At its core, comparing two teams from different eras that have consequently never played one another is tricky business. To effectively parse out the data, we need some understanding of exactly how such analysis can be performed.

As it turns out, similar comparisons have been attempted before, particularly in the world of chess.

ELO

International chess rankings have been officially calculated using the ELO rating system since 1970. ELO works by calculating the likelihood of a player winning a game based on both that player’s current rating versus that of his or her opponent’s. Over time, if the player is winning more games than his or her rating would indicate, the player’s rating increases and those of their defeated opponents decrease.

See, so sciency!

The ELO system has some notable problems: It encourages players to only target competitors they are likely to beat, and it discourages highly-ranked players from competing at all so as to preserve their rank. Neither of these apply to an NBA season where there is a set number of games and a predetermined schedule.

That said, the most relevant problem in this case is not specific to the rating system, but rather temporal circumstances: competitors are ranked relative to their present competition. There is no reliable way to objectively rank teams or players from different time periods when results are dependent on the play of their respective opponents. The most it can determine is how dominant a participant was relative to the field. Apparently, Arpad Elo (the creator of ELO) even believed it was an insufficient system for ranking players from different time periods.

Historically, ELO and variants of it have been adapted and used in a variety of applications including the (beloved) college football BCS rankings, FIFA Women’s World Rankings, Magic: The Gathering, Pokemon, and numerous video and computer games.

Trueskill

To address the problems with ELO, Microsoft created and subsequently patented the Trueskill rating system in 2007 for use on Xbox. It works similarly to ELO in that it calculates the likelihood of players winning a match. Where Trueskill differs from ELO is that it holds additional meta-information about a rating that can affect how much a player’s rating changes after a win or loss. Specifically, a “confidence score” indicates how likely the score is to be a reflection of the player’s actual abilities. Generally, as the player plays more games, this confidence score goes up, as there is a larger sample size to base the rating on. If a player has unexpected results — losing to a player they were not expected to lose to, for instance — the confidence factor can be lowered.

Confidence levels play an important role in determining how much a given result affects each players’ ratings. For example: If a player with a high confidence rating — meaning the algorithm is fairly certain that player’s rating is an accurate reflection of his or her skill level — defeats a player of higher rank but with a low confidence-rating, it’s likely the losing, higher-ranked player will see a significant drop. Meanwhile, the winner will not see as much of a change, as their confidence rating was quite high, and their opponent’s was not.

Taken to the extreme, if a player whose confidence level is high defeats a player of much lower rank, the winning player’s confidence level and rank will not change at all. If that highly-ranked player does lose to a lower ranked opponent, however, there will be a sizable drop in their ranking. These can help guard against very skilled players continuously targeting noobs.

Trueskill also allows for calculation of an individual player’s rating even when participating in a multi-player team — something ELO does not support. (This is an awesome feature of the system, but does not affect our GNTOAT exercise. Additional calculations of individual players’ contributions could be an interesting avenue to investigate.)

In general, Trueskill is more reactive than ELO to unexpected outcomes, allowing for larger swings in ratings over fewer results. It and its derivations have been used as the ratings system and in the match-making process for various video and computer games including the Halo franchise, World of Warcraft, Call of Duty, and of course Rocket League.

The Great Beyond

There are numerous adaptations of these systems, like Glicko, as well as other unrelated ratings algorithms. However, the Elovation project currently only supports ELO and Trueskill, so they are the two we will deal with.

Drawbacks

In addition to not being able to directly compare teams that have never played each other, there is also the matter that neither ELO nor Trueskill take into account margin of victory. A team that eeks out wins is likely weaker than a team that consistently wins by double digits.