Last week, ESPN rolled out its rankings, subtle changes to its interface, and its mock draft functionality for the 2020 fantasy baseball season. For better and (mostly) for worse, not much changed. And should you want to use them, their player rater results for 2019 are still around, though as of this writing, the site’s cleanup did just lead to them being labeled as the “2020 season” values.

The ESPN player rater occupies a strange role in the fantasy community. Ostensibly, it is a tool that converts players’ categorical outputs to one convenient, comparable number so that they can be ranked and assessed. In the lead up to draft day, industry titans, high-stakes players, and office analysts alike will all point to last season’s results to justify who they take and where.

But at the same time, the particularities of how it works and whether it means anything have been, to me at least, unclear. If you dive through r/fantasybaseball, u/TyroneBrownable gave what is probably the best summary of how they do it about three years ago:

Z-scores! Essentially ESPN will calculate the average and standard deviation for each stat. Then, for each player, you take that player’s stat and subtract it from the league average, then divide that result by the standard deviation. The sum of each players’ z-scores is their total player rating.

If you recall, this isn’t all that different from how either the Fangraphs auction calculator or my Fantasy 101 article on turning projections into auction values work. This turns it into less of a black box and more of a product that we should, in theory, be able to trust.

And yet seemingly similar methodologies between ESPN and Fangraphs throw entirely different results. Compare the top 10 batters on the player rater to the top ten batters produced by ESPN 10-team standard settings in the Fangraphs auction calculator:

Top 10 Hitters, 2019

They agree on a top 3, that Rafael Devers and Anthony Rendon were close and both more substantially ahead of Nolan Arenado, and that Mike Trout was 8th. Beyond that, ESPN seems to really like some shortstops that Fangraphs doesn’t. To be clear: if both systems are using z-scores and league averages to rank players, they should not be getting results are disparate as this. Something is deeply broken in Bristol.

Cracking the Player Rater Code

As it turns out, 532 of the 1408 players the player rater evaluated returned a total value of 0.00 or greater. What that would seem to imply is that the season-long player rater is based on the averages and z-scores for every player in the MLB. Pitchers are excluded, but considering that even Phillies outfielder Dylan Cozens got a rating despite getting just one at-bat, it seems like everyone else was included. I’ll dig into why this breaks even the most basic rules of survey design later, but first, it’s worth seeing the outputs that approach generates.

ESPN doesn’t offer its player rater data in an easily downloadable .csv file, but it is reasonably simple to turn category results to the slopes of the lines they create. Considering ESPN was only offering its data to two decimal places, an exact linear regression wouldn’t have increased accuracy. I was able to backward-engineer these z-scores by taking a handful of batters’ worth of data points for each category and finding the slope of the and y-intercept of the line those points created.

Yes, cracking ESPN’s player rater was possible using nothing more than skills from the first few weeks of Algebra I.

Here are the implied averages and z-scores that we can discern from the season-long player rater:

ESPN Player Rater Batter Formulas, 2019

And here are those same stats from 10-team ESPN standard, the format the site leans toward:

Actual Batter Value Formulas for ESPN 10-Team Standard, 2019

Comparing z-scores, here is how much the ESPN player rater inflated or deflated the relative values of players from last year:

ESPN Player Rater Batter Z-Score Inflation, 2019

For a more detailed explanation of how the “actual” numbers should be calculated, you can read my Fantasy 101 piece on turning projections into rankings (again, that method can struggle to deal with the playing time extremes).

But in short, we should be able to read this as ESPN inflating the relative value of stolen bases by more than 155% while heavily devaluing runs and RBI. The effects on home runs and average are smaller but still considerably bad. For those that are curious, lower averages do not change the standard deviation or z-score of any of the counting stats; the standard deviation for hits above average just barely changes to 13.3456 when using ESPN’s batting average instead of the .2770 mark a 10-team league should have averaged.

We should also read this to mean that being above zero in a category for the year did not necessarily mean that a player was a positive contributor in the context of any real fantasy league. About 260 batters achieved a score above zero in each category other than steals, where that number was 188. Almost every player drafted in any ESPN league that notched at least two steals and a .256 batting average qualified as a “five-category player.”

I can’t access any 7-day, 15-day or 30-day scores, but it’s safe to assume that these numbers are being generated in a similar way. They’re probably much closer to being useful—far fewer players are getting playing time on a weekly basis—but they’re still not driven by anything particularly complicated or grounded in anything mathematically coherent. Over a given 7-day period, the number of platoons would still mean that its results would probably only be correct for 20-team leagues or so, and given that ESPN doesn’t offer that, it’s unclear what you should discern from them. Maybe that you should have benched Matt Carpenter much earlier than you actually did.

To clarify why this is wrong, we’re taking a short detour into the basics of survey design. If you’re polling MLB viewers on which features they’d like to see added to game broadcasts, it’s your responsibility to do your best to ask people who watch MLB games that question. You might screen your poll respondents by asking something like “How many baseball games do you watch on TV per month?” or “When was the last time you watched at least an hour of baseball on TV?” to guarantee that you can get a survey group that, as best as possible, matched the group you want data about. Ideally, you would also want your survey group to roughly match the demographics of your viewers, but if they weren’t quite right, you could weight your results accordingly to increase the reliability of your questions. Only after you’ve done your best to make sure that your survey population matches the group you want information about should you really care about the results you get.

The ESPN player rater is doing the equivalent of polling every person that owns a TV about what changes baseball should make to be entertaining. It’s about as close to Calvinball sabermetrics as you can get—there’s some real math in there, but the rules are changing without reason and the result is nonsense.

Putting Ratings in Context

While the dramatic differences in z-scores in averages should be damning, it can be reasonably hard to understand what bad results look like without some context. You can probably see the beginnings of the player rater’s issues just by comparing Jonathan Villar’s place as the 4th batter overall according to ESPN with his spot as the 15th batter on the Fangraphs auction calculator using ESPN 10-team. But considering we expect most players at the top of the list to be elite in multiple categories at the top of the list, it’s more informative to move farther down and check in on the results for players who only have one or two above-average skills.

So, let’s put it to the test and compare two batters from 2019 who tied for #160 overall. Who had the better season?

Player Comparison #1

If you guessed that Player A is reigning AL Rookie of the Year Yordan Alvarez, you were correct!. Player B may also stick out to fans of the Chicago White Sox as Leury García, who notably finished as the #8 hitter in the Pitcher List WorstBall League.

Were they equally valuable? Absolutely not. Alvarez played 87 games, starting 83. García had 135 starts and a total of 140 games. Should you have been choosing between the two at any point last year? No.

I can hear the cascade of “unfair!” chants from the peanut gallery—Fangraphs also has problems with players who spent time in the minors. But even still, analysts will point to overall rank to determine who had better seasons, and season-long z-scores have definite problems with games played. I’ve chosen this example for a reason—any model like this will struggle to analyze when looking backward, especially with players whose playing time numbers were deeply different.

So, to be fairer, I’ll add on some players who haven’t even sniffed the minor leagues in the past five years to the comparison.

Player Comparison #1B

Player C (153rd overall) is 2019 NLCS MVP Howie Kendrick, who was a pinch hitter in 41 of the 121 games he played last year. Player D (154th overall) is Cesar Hernandez, whom the Indians acquired after he was non-tendered by the Phillies, and who started in 155 of his 157 appearances.

Their futures aside, these four players combined to represent four astonishingly different paths to fantasy mediocrity as measured by ESPN. Alvarez and Kendrick each saw about a half-season of starts for two very different reasons, while García played roughly an average number of games and Hernandez started nearly every day. All four of them accrued total stats that would be roughly equally valuable by ESPN’s valuation.

The conclusion here is simple. By not including playing time into the equation, ESPN’s player rater makes it second in importance only to steals. It essentially assumed that we would have kept these hitters in our lineups on their days off—or even more absurdly, that we would have been starting Alvarez during his 56-game stint in at AAA Round Rock.

Excluding Alvarez, this does make for a realistic comparison if we’re dealing with weekly lineups. Considering ESPN’s reputation for catering to more casual fans and players, tailoring their cornerstone metric to their core audience makes some sense, and I want to give this model some credit for that. And ESPN is certainly not alone in this problem—Fangraphs and my method for roughly repeating its results are both guilty of it as well.

That wrinkle aside, there are more damning results available when we don’t cherry-pick playing time extremes. The following two players were just six at-bats apart and finished as the 39th and 40th batters overall.

Player Comparison #2

The conclusion of this comparison should immediately make sense. If the z-scores didn’t make this clear, this is an incredibly strong preference for steals. Give any player 21 of them and their value should start high, but ESPN is valuing the 21 additional stolen bases that Tommy Pham picked up above the additional 13 home runs, 11 additional points of batting average, 33 runs, and 25 RBI that Carlos Santana earned.

Maybe you consider that first base is typically a strong position, which might mean Pham gets ranked higher for playing in the outfield. While this could potentially be problematic—I’m somewhat anti-adjustment for hitters other than catcher—it would at least do some work to explaining how he gets ahead of Santana.

That is definitely not that case, though. Let’s examine another pair, the 58th and 59th batters overall:

Player Comparison #3

These two players have reasonably comparable outputs. It’s worth noting that Player B achieved his stats in 138 starts over 145 games, compared to Player A’s more limited 115 starts in 123 games. Again, it looks like four additional steals are getting a lot more value than they should.

Player A is Ramon Laureano. Player B is JT Realmuto. And ESPN isn’t adjusting for positional eligibility.

This should have been clear based on the math we did above. The outputs for each category match almost perfectly to what the slope predicts, and the total is just the categorical outputs added together. There was never any room for Realmuto to have his value pushed up—it’s just straight-up excluded from the equation.

Not adjusting for positional eligibility is even more unfathomable than Realmuto’s ability to start 138 games. By not doing this, ESPN unfairly pushes down the value of catchers across the board and makes using its tool for trades involving catchers impossible. Considering that their tool handsomely rewards playing time, Realmuto should be a huge beneficiary, but he gets hung out to dry instead.

Via different mistakes, the ESPN player rater seems essentially useless for comparing players with different positions, categorical skills, or games played. So what is it good for? Let’s look at these two:

Player Comparison #4

Thanks to the help of the ESPN, we can safely conclude that Gleyber Torres was better than Mike Moustakas over the course of the 2019 season.

And oh, are we thankful for that help.

What About Pitchers?

As far as pitchers go, something akin to the player rater is somewhat less interesting because of how many good pitchers will inevitably miss time throughout the year due to injury. If we know that an emphasis on playing time is going to push up pitchers, we could predict that it might overemphasize durability compared to actual skill. But considering that two of the five stats are ratio-driven and almost every starter is aiming to both win games and strike out batters, the best pitchers should still be at the top.

Again, let’s compare the top 10 players for both the ESPN player rater and the Fangraphs auction calculator for ESPN 10-team standard:

Top 10 Pitchers, 2019

Both agree on the top three in order to start and on the order of Greinke, Flaherty, and Bieber. But ESPN pushes down Ryu and pushes up both Strasburg and Hader—who comes in at 17th according to Fangraphs, for reference.

We can use a similar approach to what we described with hitters to peek under the hood; the formulas that led to ESPN’s ratings are shown below. To factor in innings pitched, it uses runs below average instead of ERA and walks plus hits below average instead of WHIP, as is standard.

ESPN Player Rater Batter Formulas, 2019

Compare that against the values we would arrive at by assuming seven starters and three relievers per team:

Actual Pitcher Value Formulas for ESPN 10-Team Standard, 2019

And again, the z-score inflations:

ESPN Player Rater Pitcher Z-Score Inflation, 2019

Four of these are really close, which means it should be able to reasonably compare starters. But with saves doubling in value, it massively inflates relievers, especially those whose primary value comes from their starting job. When combined with the lack of positional adjustments, this means that ESPN has been pushing closers up way past their actual earned values, which explains Hader’s rank. Whether this changes Matthew Berry’s commandment to never pay for saves is yet to be seen, but he was definitely right that his employer’s formula was wrong on them.

But don’t mistake just having the z-scores mostly correct to their formula being right. The averages that they start from are still far, far away from close. Once again, this is the result of seemingly including every pitcher out there aside from the occasional position player to pitch (though Russell Martin would have achieved some excellent ERA and WHIP numbers). But even so, their ERA mark was off by a full run per nine!

Another notable conclusion that we should pull from this is that many pitchers who definitely tanked their owners’ rankings in pitching categories still received very positive scores. 180 innings of a 4.00 ERA and 1.25 WHIP would need to be combined with good strikeout totals to be viable in ESPN 10-team leagues, but it would have rated quite positively by their player rater no matter what. Consider the following example:

Player Comparison #5

Player A is Danny Duffy, whose ERA and WHIP should have kept him far away from anything other than 15-team leagues, and his lack of true workload—just 130 innings!—meant that he wasn’t providing great bulk for those teams either. Player B is Aaron Civale, whose ratios meant that he would have been well worth starting even in 10-team leagues, even if his strikeout rate left something to be desired. Neither was fantastic, but Civale likely did much more to help teams win their leagues than Duffy. By not weighting ERA and WHIP properly, this was not nearly as obvious as it should have been.

That said, neither should have been players that cracked the top-100 and factored into the z-score calculations. Let’s look at two players who just barely did, at 87th and 88th among pitchers:

Player Comparison #6

Considering his low save total, I’m not convinced that our Player A, Will Harris, is egregiously overrated here. And our player B, Caleb Smith, might be under-penalized for his ratios. But just like the comparison between Realmuto and Laureano was problematic, so too is this one—though why is much less obvious.

The issue comes from what happens when we look beyond these two and at the top 100 as a whole. The player rater places 37 relievers in its top 100 pitchers overall, excluding dual-eligible players and Rays “false starters” Yonny Chirinos and Ryan Yarbrough. But if 10 teams were to roster only these 100 players, they wouldn’t sniff the 200 games started cap that ESPN imposes. With nine pitching starters, three bench spots, and one IL spot, ESPN standard leagues simply don’t have the room for that many relievers on their teams’ rosters. It’s far more appropriate to assume that the average team would roster a maximum of three relievers and about seven starters, if not more—hence that z-scores above.

In order to make our rankings reflect who should have been owned, we have to give a smaller positional adjustment to relievers to push their value down—the opposite of what we do for catchers, but for the same reason. Applying ESPN 10-team settings to the Fangraphs auction calculator, it gives 2019’s pitching starters an additional $10.4, which represents about the amount players can spend per player, and pushes the average drafted player to about that amount and the last drafted starter to about $1. And by setting the positional adjustment for relievers comes at just $4.8, it rightfully pushes down their relative value. ESPN neglects to make any such adjustment.

Curiously, the reliever adjustment tends to run closer to $7 when using either ATC or Steamer’s projections for 2020. We know that a few relievers are going to have massive outlier seasons with absurd WHIP and ERA numbers; we just don’t know which. By neglecting to adjust at all, ESPN happens to be more wrong when looking backward than it would be projecting forward, which it doesn’t try to do.

Conclusions

For those keeping score at home, that means the player rater should not be used for:

Comparing players with different numbers of at-bats or innings pitched

Comparing players who were called up from the minor-leagues or were long-term injured to those who were not

Comparing hitters or pitchers whose value come from different categories

Determining whether hitters or pitchers should have been rostered over the course of the entire season

Comparing catchers to other hitters

Comparing relievers to starting pitchers or relievers who don’t get saves

Determining what a “good” batting average is

Determining what a “good” ERA or WHIP is

Comparing pitchers with different ratios, regardless of playing time

Here’s what you can use it for:

Comparing players with near-identical seasons

Fleecing competitors into buying your excess steals and saves for exorbitant prices

This almost certainly does not have to be the case. In its 2019 year-end press release, ESPN bragged of over 20 million users for its fantasy sports products in the month of September alone. It certainly has the traffic and resources to fund a better product. Especially when considered against the work that ESPN Stats & Info produces, or that John Hollinger has previously produced for its NBA fans, it should strike readers as farcical that these numbers are even allowed on their site. Tabulating real-time win probabilities for football games should be far more difficult than producing a functional end-of-year fantasy baseball ranking.

And yet, I have little faith that they will—a shrinking number of features and the declining quality of those that remain at Mickey Mouse Fantasy don’t inspire confidence. And all told, this is a shame. It’s not like the site’s content is an unmitigated disaster — in the past eight years, Tristan Cockroft has three wins and three second-place finishes in NL Tout Wars, and his rankings routinely make FantasyPros’ top-10 list for end-of-year accuracy. It’s a shame that players who are aware of ESPN’s shortcomings might extend any distaste toward the site to him as well.

Clearly, ESPN is capable of delivering the improvements that its readers and writers deserve. So what would better look like? At the very least, it would be easy to convert the existing system to deliver a better version of what Fangraphs provides, automatically tailored to the individual league settings of its users. It should be able to tell players how far above or below average a player compared to that league’s player pool. It would even be possible to substitute the calculated, best-case averages that we have to rely on for our projections with the dynamic league averages—instead of the .277 batting average or 3.57 ERA my model predicts, it could easily substitute the averages that players are actually competing against.

A more serious improvement would be to move away from using season-long z-scores and averages. Instead of comparing Cesar Hernandez and Yordan Alvarez based on a system that assumes Alvarez was parked in a starting lineup for the entire season, they could be calculating how much better or worse than the average better players were on a per-game basis, and then multiplying that by games played, games started, or innings pitched. And considering they have access to the largest trove of fantasy baseball league data in the world, they might be best positioned to determine the exact penalty that taking more games off than average provides.

But short of serious and transparent changes to how the rater works to guarantee that we can trust its results, it deserves to be scrapped. And if ESPN won’t do that, it’s on the community to ignore it, alongside any analysis that depends on it.

Graphic by Justin Paradis (@freshmeatcomm on Twitter)