Ranking Ultimate Teams With the Elo Rating Algorithm

What if we ranked teams with a different system?

Anybody who has even casually followed USA Ultimate’s college and club series since 2011 has surely heard somebody complain about the algorithm that USAU uses to rank teams and distribute bids. Opinions have ranged from neutrality (“just win all your games”) to tacit endorsement (“at least they don’t use the previous year’s nationals results anymore”) to harsh criticism, only occasionally coupled with an alternate proposal (shout out to probabilistic models).

But for all of its controversy, the USAU algorithm is generally competent at evaluating the relative strength of teams using only their box score results; the (rightful) point of contention is in how that algorithm, and its weaknesses, are leveraged as a part of the nationals bid distribution system. However, while there is enough fuel in the prior statement to power a series of articles, this piece will focus more on applying another common ranking tool — one used in many other competitions — to rank college ultimate teams: Elo Ratings.

Elo Background/Primer

Chess master and physicist Arpad Elo originally created his eponymous algorithm to improve how the US Chess Federation ranked its wide pool of competitive players. The ratings were designed to meet the task of ranking the relative skills levels of players in situations where the particular players may not necessarily meet head to head.

Elo ratings are still used to rank chess players today, and the algorithm has found its way to the desk of many other sports statisticians, producing ELO ratings for world football and the NFL, as well as applications ranking other large groups of competitors in games like Go and League of Legends.

The concept of the algorithm is quite simple: a player’s rating is re-evaluated after every contest (or collection of contests) based on the rating of his opponent and the result of the match. All matches are evaluated chronologically so that the rating change derived from the match is a function of what the players’ ratings were at the time rather than what they might end up being later (a notable difference from USAU). Further, the rating change that occurs after a game is a function of the following:

Relative rating of the teams. Based on the initial rankings of the players, Elo ratings calculate a win expectation percentage. The rating points awarded or subtracted are relative to the result of the game vs. the expected outcome. A victorious team that had a 10% chance of winning will earn more points than a team who won after being rated as 90% favorites. The win expectation formula can be changed, but is often done such that a 400-point rating differential indicates a 90% win expectation for the favorite.

Based on the initial rankings of the players, Elo ratings calculate a win expectation percentage. The rating points awarded or subtracted are relative to the result of the game vs. the expected outcome. A victorious team that had a 10% chance of winning will earn more points than a team who won after being rated as 90% favorites. The win expectation formula can be changed, but is often done such that a 400-point rating differential indicates a 90% win expectation for the favorite. Margin of victory (in many variations). In competitions where margin of victory (MOV) is quantifiable (ultimate, but not chess), the point change will be correlated with MOV. Different sports have different implementations of their MOV multiplier. Many implementations will award a high margin of victory with a high multiplier, but then lower it again if the winning team was rated much higher than the losing (a big win by a favorite gets less weight than a big win in an expected toss up).

(in many variations). In competitions where margin of victory (MOV) is quantifiable (ultimate, but not chess), the point change will be correlated with MOV. Different sports have different implementations of their MOV multiplier. Many implementations will award a high margin of victory with a high multiplier, but then lower it again if the winning team was rated much higher than the losing (a big win by a favorite gets less weight than a big win in an expected toss up). The update parameter (or k-value). The k-value represents how much influence the latest result has on one’s ranking change. High k’s mean that the latest game will have a larger effect on a player’s rating, while a low k lightens the influence.

Elo Specs

Assumptions

– Each party/player’s performance in a match is a normally distributed random variable around their true “skill level.”

– New players enter the rankings at a fixed “average” rating (generally 1500)

Notable Properties

– A player can never hurt their ranking by winning. The size of the benefit may be affected by margin of victory and the relative ranking difference between the players, but ranking will never decrease. This differs from the USAU model where a close loss may help or hurt a ranking.

– In the same way, a loss can never help a player’s ranking. The size of the penalty may range based on the margin of loss and the relative ranking difference between the players, but ranking will never increase

– Rankings are state functions. A player’s future rating is independent of past results– it is conditioned only on his current ranking, and on the performance in his next match. This is in contrast to the USA Ultimate ranking where the future rating is calculated by a new result plus all past results.

Weaknesses

– The performance curve. The assumption of a normal distribution of performance, while likely accurate in the infinite, introduces bias into the system.

– Provisional rankings. When a player first enters the system, they are given the ‘average’ rating. If the player’s true skill is much higher or lower than average, then for the first n games the player participates in his opponents are being disproportionately rewarded/penalized

– Tuning the k-factor. There is definitive guide for how to tune the update speed of the algorithm. While in general high-sample sports like baseball lend themselves to low values, and lower-sample sports like American Football favor faster updates, much of the guidance for the update speed is simply heuristic analysis by the investigator.

Adaptations for Ultimate

– In order to avoid the bias of provisional rankings, the teams were given initial Elo ratings based on the prior year’s USAU ranking. The top team in the 2015 USAU rankings (Pittsburgh) was initialized to 1700, with each subsequent team initialized at one point lower. Teams that were not ranked in 2015 received the average rating of 1500 when they entered the system. The women’s receive similar initialization but starting at 1650 so that the average initialized rating was also 1500.123

Rankings

Below are the elo rankings generated for the 2016 season, including Conferences and Regionals.

Men’s Division

Rank Team Elo Rating 1 Oregon 2008.51 2 Massachusetts 1982.85 3 Minnesota 1960.44 4 Pittsburgh 1940.95 5 North Carolina-Wilmington 1938.86 6 Wisconsin 1919.38 7 Georgia 1893.68 8 Carleton College 1889.76 9 Case Western Reserve 1888.41 10 Washington 1880.10 11 North Carolina 1877.94 12 Stanford 1868.68 13 Harvard 1858.89 14 Cal Poly-SLO 1854.86 15 British Columbia 1853.75 16 Colorado 1852.73 17 Ohio State 1850.29 18 Brigham Young 1845.31 19 Texas A&M 1839.42 20 Franciscan 1837.36 21 Florida 1832.04 22 Georgia College 1821.27 23 Michigan 1820.47 24 Florida State 1818.52 25 Brandeis 1818.26 26 Georgetown 1815.66 27 Virginia Commonwealth 1814.01 28 Penn State 1813.15 29 Auburn 1810.64 30 Arkansas 1805.33 31 Connecticut 1801.19 32 Northwestern 1800.71 33 Georgia Tech 1795.68 34 Lewis & Clark 1795.67 35 Bryant 1794.64 36 Richmond 1790.74 37 Maryland 1789.24 38 New Hampshire 1787.28 39 California-San Diego 1777.85 40 Cornell 1777.38 41 Texas 1773.92 42 Missouri 1772.18 43 Air Force 1771.71 44 Virginia Tech 1767.49 45 Notre Dame 1766.57 46 James Madison 1763.60 47 Carleton College-GOP 1762.44 48 Utah 1761.10 49 Colorado College 1755.26 50 Baylor 1754.42

Women’s Division