MLS referees are better than you think they are.

I may be setting that bar a bit low. Soccer fans (or, indeed, fans of any sport) tend to follow teams and players rather than officials. To the observer, referee performance is extremely important for a single game at a time, and the "best" performances are often the least memorable - "game-changing" calls are never a positive way to determine a match result even when called correctly. When the next game comes along, the person who made the call has usually moved on to judge a different pair of teams.

Already in 2015 there have been several referee actions that have had a greater perceived impact or, at least, been discussed longer than the game in which they occurred. Kevin Stott's erroneous red card to Jermaine Taylor didn't seem to change the earned result, but was so egregious as to merit discussion outside the game in question (and later be overturned on Disciplinary Committee review). Stott's initial (lame) defense of the call on the field didn't help matters with fan response. In the Seattle Sounders' game against the Columbus Crew the considerably more defendable (but likely still wrong) lack of an offside call on Clint Dempsey's goal similarly did not affect the result, but received extensive post-game judgment and analysis. On the other side of the coin, Ismail Elfath's impact deciding the game between Seattle and Sporting Kansas City may be discussed as long as the Supporters Shield race lasts.

So... objectively measuring referee performance is a daunting task. Fans are deeply biased regarding in-game officiating impacting their own teams. High-profile mistakes that can be verified on tape are isolated, poorly-representative incidents within a game. Certain decisions may be improperly attributed to the center referee when only a linesman could have been in position to make a correct call (as with the Dempsey example above). Being a referee is also difficult - having to follow the performance of exceptional athletes over 90 minutes and a large area of the field... and then having to answer to multitudes of spectators armed with bird's eye views, Wachowski-sibling video replay, and delicious alcoholic beverages. I respect anyone who puts true effort into doing the job right.

Here, we're going to look at a sample of 410 Major League Soccer games (spanning the beginning of 2014 to the 23rd of April, 2015), and we're going to measure center referees (CRs) by one of the only objective means available. Namely - comparing referees to one another while putting officiating statistics in proper situational context, we can assess whether referee styles vary and whether some referees call a significantly different game than others.

Yes, "better" and "worse" are kinds of "different," but different is an easier place to start. The following list orders the 20 CRs who have called more than 10 games in that time by descending "abnormality" (it is important to remember that this is NOT a measure of call accuracy):

Center Referee TotalScore Marrufo 18.18 Guzman 11.60 Bazakos 10.60 Fischer 10.33 Stoica 10.23 Penso 10.00 Unkel 9.88 Jurisevic 8.50 Gantar 8.00 Petrescu 7.67 Chapman 6.57 Elfath 6.36 Toledo 6.00 Villarreal 6.00 Geiger 5.06 Stott 4.40 Salazar 4.32 Kelly 3.80 Grajeda 3.27 Rivero 1.09

In the end, we'll see that MLS CRs do an admirable job of avoiding significant home/away bias, that most games remain within reasonable physical control, and that the Professional Referees Organization's efforts to improve officiating have a significant impact on behavior. For the statistical basis of these claims, I invite you to read on below.

We will also see that referee performance may be improved... we know as much from the public video review of their prominent mistakes. Some possible changes for the sake of accuracy are unrealistic without a dramatic shift in soccer culture (put a mic on a 4th or 5th official in an upstairs booth with replay video - there is no legitimate competitive reason to permit avoidable bad calls), but greater awareness of how often (and to what extent) games are called poorly (and by whom) may at least give us some path to greater accountability.

If, in reading, you come up with your own ideas of how to measure CR performance or how to improve MLS officiating, please share in the comments.

Foul unto others as they are wont to foul

"Physical" play tends to describe both teams on the field when it happens - either because teams attempt to match intensity to their opponent or mutually test the margins of CR permissivity. This trend - that rough play begets rough play - may be perceived in the positive relationship (albeit with a poor fit to the data) of home team to away team duels won and fouls.

The total number of recorded duels in a game (50/50 chances between players, excluding fouls - these are essentially aerials, tackles, and attempts to dribble around) ranges from as few as ~40 to as many as ~120, but we typically assign referee stats on a "per game" basis that ignores that disparity. One referee may call a substantially greater number of contentious or boring games than another. Similarly, one team may play greater or fewer games of a particular type, but our measures of team fair play are usually based on carded incidents per game. Insofar as poorly-timed or overly physical 50/50 challenges cause fouls, a team that attempts to win the ball more often will inevitably foul more often if there is no difference in skill. Sporting Kansas City and Chivas USA have the highest fouling rates in the dataset (~14.3 fouls per game). For SKC, that metric is consistent with a reputation for physical play... but the team wins 3.26 duels for every foul committed (roughly league-average) and 6.86 duels between SKC and their opponent are won for every foul (above the league average of 6.57).

One might fairly contend that Kansas City employed some relatively "dirty" individual players (Aurelian Collin records a relatively high number of fouls and cards, even within the context of his role), but it has not recently been a poorly disciplined team overall. The box and whisker plots above show the distribution of full-game values for duels (team and opponent) per foul (team only) and carded events per foul across the dataset. The shaded boxes mark the middle 50% of values (the inside line is the median), and the lines extended to the side of the boxes show the range (games deemed to be statistical outliers are plotted individually). Teams are sorted top-to-bottom by mean duels per foul.

This particular metric is not necessarily the only or the best means of controlling for fouling opportunities. For one thing, we may not all agree on what does or does not constitute a "duel," and the likelihood of receiving a foul for a bad tackle is considerably greater than that for attempting to dribble around a defender (i.e. not all duels are the same), but we should nevertheless question our more common perceptions concerning team discipline, particularly in cases where teams show both lower-than-average duels per foul and higher-than-average cards per foul (Toronto FC and the Chicago Fire). The Seattle Sounders, Portland Timbers, and Real Salt Lake are possibly more prone to fouling than they should be. The Vancouver Whitecaps and Chivas USA were relatively card-prone over the studied interval. Both expansion teams, in early returns, need to rein in physical play (Orlando City, in particular). These metrics may also not be representative of the current squads. Take the Seattle Sounders, for example.

Zach Scott, Leo Gonzalez, and DeAndre Yedlin were all relatively foul-prone in 2014. Brad Evans, Dylan Remick, and Tyrone Mears are not. A modest trend towards higher values seen in (a portion of) the 5-game moving average may well show Seattle to be a different defensive team in 2015 (or it may not... it's early yet). Disciplinary tasks vary over time and between teams. Consequently, accurately measuring league or referee-specific statistics depends on efforts to normalize opportunity as shown here.

Hang together, or hang separately

The duels won per foul metric relies on the reasonable assumption that more attempts to challenge for the ball will inevitably lead to a greater number of fouls. Player by player, an April, 2015 OPTA dataset (via whoscored.com) supports this assumption with a loosely positive relationship between fouls and defensive actions (DefAct; tackle attempts, clearances, interceptions, and blocks):

On the right side of the figure, the correlation between DefAct rate and DefAct per foul is complex, though fairly linear from values 0-20 on both axes (as shown in the inset plot). A linear trend should be expected and not overinterpreted (defensive actions being in the numerator of both variables). Nevertheless, the figure shows that players with a greater defensive role record fewer fouls per defensive action. Primary defenders either exhibit greater skill in challenges, have better selection of defensive opportunities, receive the benefit of the doubt from CRs more often, or (most likely) some combination of all of the above. Across full league performance, the roughly positive correlation between duels and fouls with team and player performance suggests individual games should display a similar relationship.

...but they don't. I've been fairly sanguine so far concerning poor fits to linear regression lines, but the league numbers above are exceptionally uncorrelated. Game by game, either fouls are being called inconsistently, many fouls are committed for incidents that are not clumsy challenges, or broader league trends are hiding the predicted relationship.

MLS and the Professional Referee Organization (PRO) oversaw two significant changes concerning league-wide officiating patterns over the studied interval. In the first instance, a PRO lockout of personnel affiliated with the Professional Soccer Referees Association lead to MLs using lower-league and former MLS officials in their place for 2 weeks (16 games) at the beginning of 2014. In 2015, PRO called for increased emphasis on issuing cautions for persistent infringement (in 2014, the league foci were "game flow" and jostling in the box during set pieces - neither of which could reasonably be measured by this dataset even if they had some impact on behavior). Bearing in mind that the team and player composition has also changes over the same time span, both changes appear to have made an impact on carded events. Emphasizing persistent infringement should impose an increase in yellow card per foul ratio:

I should add that I calculate yellow card events somewhat differently than shown in the top-level statistics at mlssoccer.com, from which the majority of the analysis in this article is derived. For MLS and OPTA, a CR issuing a second yellow is considered a red card (makes sense, as the recipient is sent off), is not separately considered a yellow card (makes sense, as it's a single carded event), and also removes the record of the first yellow issued to the player in question (makes no sense, and is among the sillier scorekeeping decisions in the sport). When I discuss yellow cards, all yellows are included (both first and second yellows). When I talk about "carded" events, second yellows and the consequential red are treated as a single incident. If this seems a minor distinction, consider Ismail Elfath, who issued 4 second yellows in 17 games. The absence of 8 yellow cards - real actions on the field he considered cautionable - makes for a difference of 0.471 yellows per game in his overall statistics. Half a card per game can be quite a bit in the proper context... that's over a quarter of the cards Jair Marrufo hands out per game.

In the data above, there are 16 games in the lockout phase (0.147 yellows per foul), 307 other games in 2014 (0.128), and 87 games in 2015 (0.143). The differences may seem slight (and the scatter may seem substantial) but they strongly suggest the changes to league officiating and/or composition had an on-field impact. If one selects a continuous 16-game interval from the post-lockout 2014 dataset, only 25 of the 292 potential windows have a mean yellow per foul value equivalent to or exceeding the lockout phase (i.e. a random selection of games would give such a high value 8.56% of the time). In 221 potential 87-game windows of the 2014 dataset, none match the 2015 average (maximum 0.137). We may, therefore, be reasonably confident in the case of the lockout, and extremely confident in the case of the 2015 officiating policy, that on-field behavior of either teams or referees has changed. Could such modifications partly explain the poor correlation of duels and fouls in the full dataset?

Neither the post-lockout 2014 data (blue diamonds) nor the 2015 data (red squares) show a strong positive correlation. MLS officiating may change over time due to top-down league management (and this is a good thing, when it involves attempts to call the game more fairly or with better adherence to the rules), but no temporal trend explains the duel vs. foul confusion above.

They are not on anybody's side, because nobody is altogether on their side

Alternatively, individual referees may exhibit very different foul thresholds; significant duel vs. foul linear correlations may exist within the dataset for select games, but could have different slopes/intercepts depending on the man in the middle. Does such a correlation exist for the 20 referees who called at least 10 games in the studied interval?

All individual plots are depicted on the same scale as the duel vs. foul graph above. The contentiousness of a game - as measured by total duels won - is not always consistent between one referee and the next. As expected, foul calling thresholds appear to vary substantially. A positive linear relationship between duels and fouls is rare, but present. Some referees exhibit foul calling behavior that seems unrelated to game condition (poor fit to the regression line). Others appear to call even fewer fouls in relatively contentious games. Ultimately, I think the broad absence of duel vs. foul correlation in MLS officiating is mainly due to the complexity of the data (e.g. fouls called for reasons other than poor challenges, and challenges being an overly obtuse metric) and CR inconsistency (both their own, and among referee choices).

When fans discuss referee bias, home vs. away favoritism is the common target (or dislike for a particular team, but such behavior is definitely not reflected in the numbers). Academic research has identified such bias experimentally (a significant study by Nevill et al. in 2002 showed qualified referees called fouls differently on viewed incidents depending upon the addition of crowd noise) or careful study of league statistics (Boyko et al., 2007 additionally noted that the impact home/away bias appears to vary according to the specific referee). At first glance, the MLS data appears to confirm such disparity - an average of ~0.28 additional fouls and ~0.30 additional cards are given to away teams in each game.

However, losing teams exhibit similar bias. The graph above shows the distribution of home/away and winner/loser foul and card disparity by game. A negative value indicates the away team in home/away or the losing team in winner/loser received more fouls or cards. A bias in favor of the home team would therefore manifest as a distribution skewed towards the left side, and this certainly happens... but a losing team attempting to fight its way back for a result appears to foul more often and receive more cards than its opponent, even when the dataset is limited to those 101 games in which the away team took all 3 points (though the data is not shown, a similar distribution is observed even when red card games are excluded; sending off a player has a substantial impact on results, as away teams average greater than 1.8 ppg when playing with favorable red card disparity). In the overall MLS data, unique home/away bias independent of broader away team competitive disadvantage cannot be readily identified. For that, MLS and PRO deserve some modest accolades, though some individual referees may exhibit greater bias than others (Fotis Bazakos, for example, issues 2.40 fouls and 0.93 additional cards to away teams, well beyond the league average).

We've now come up with well-contextualized metrics for referee performance, and have a fairly substantial dataset for 20 referees in particular. We can now build the table referenced at the head of the article.

The 6 metrics break into 3 categories: overall officiating (total fouls; total carded events), home/away bias (home/away fouls; home/away carded events), consistency (fouls per yellow; duels won per foul). Note that the "fouls per yellow" metric has a value of 1 added to both the numerator and denominator to avoid divisions by 0. The six metrics are calculated for the full dataset and on a by-game basis for each referee. For assigning a referee score to each game, a value more than 2 standard deviations from the 410-game mean has been given a value of 1 (all other games are scored 0) and then all points are summed together. This should be a conservative measure of "disaster games" - occasions where officiating had a massive footprint or was completely absent from play despite substantial need. The base score has then been normalized to maximum games called (i.e. the number of disaster performances has been adjusted to the value if the referee had called 24 games). Additional points have been added to the score for overall values outside 2 s.d. from the mean of the 20 referees calling 10 or more games. Finally, I've assigned a qualitative modifier based on the duel vs. foul trends plotted above, giving lower scores for positive linear correlation and better fit.

Center Referee AdjScore Overall DuelTrend TotalScore Marrufo 14.18 3 1 18.18 Guzman 9.60 0 2 11.60 Bazakos 9.60 0 1 10.60 Fischer 9.33 0 1 10.33 Stoica 9.23 0 1 10.23 Penso 8.00 0 2 10.00 Unkel 9.88 0 0 9.88 Jurisevic 7.50 0 1 8.50 Gantar 8.00 0 0 8.00 Petrescu 6.67 0 1 7.67 Chapman 4.57 0 2 6.57 Elfath 4.36 0 2 6.36 Toledo 7.00 0 -1 6.00 Villarreal 8.00 0 -2 6.00 Geiger 7.06 0 -2 5.06 Stott 2.40 0 2 4.40 Salazar 6.32 0 -2 4.32 Kelly 4.80 0 -1 3.80 Grajeda 3.27 0 0 3.27 Rivero 2.09 0 -1 1.09

A reader may easily disagree with the weighting or merit of any or all of these factors, so each of the three are listed individually along with the final added abnormality score. High scores conclusively demonstrate games have been called outside the MLS officiating norm. Importantly, this overall approach is normalized to the contentiousness of the games in question and pretty forgiving towards differences in "style." Consistently scoring high is calling a game with different rules - teams should not be asked to change sport from week to week based on the man in the middle.