Projection systems are wonderful tools, both for analysis and (especially during January) entertainment. They’re a mainstay of the baseball landscape at this point, and nearly everyone knows what they mean and how to treat them. The predictions they make aren’t gospel, and they’ll frequently be wrong, sometimes in ways that seem obvious after the fact. But for large numbers of players, they’re the best way we have to estimate future performance.

How good are they, though? This is an important question; there are numerous options for someone trying to determine how good Jose Bautista’s 2017 will be, and it can be nearly impossible to tell which system is more reliable or accurate without an in-depth look. The inner workings of these projection systems are generally kept secret, and with good reason; a lot of work has gone into them, and their proprietors deserve compensation. But as various events in 2016 have made apparent, it’s crucial that we question our mathematical models, because the validity of the assumptions they’re based on can fluctuate, sometimes without us noticing.

That’s what this article is for. As I did last year, I’m going to compare the accuracy with which the three major public projection systems predicted the performance of players over the last year. Those systems are:

PECOTA, originally developed by Nate Silver, now owned by Baseball Prospectus and run by Harry Pavlidis, Rob McQuown, and Jonathan Judge;

Steamer, created by Jared Cross, Dash Davidson, and Peter Rosenbloom, and available at steamerprojections.com, FanGraphs, and Razzball; and

ZiPS, from Dan Szymborski, and also available at FanGraphs.

I’ll be comparing them to each other, and also to “The Marcel the Monkey Forecasting System.” Marcel, conceptualized and then released into the wild by Tom Tango, was designed to be as simple as possible, and a way of evaluating all other projection systems to see if they added anything to a ultra-basic method of projecting. (So easy a monkey could do it, hence the name.) Marcel uses nothing but a player’s last three years of performance, with more recent performance given more weight, and a very simple age adjustment and regression component. If a player has no MLB performance in the last three years (a rookie, an international player, or a player returning from a long hiatus), they are projected to be precisely major-league average in every respect.

Marcel is a very useful, very blunt tool, and it’s the perfect baseline for this kind of project. Because the methodology is public, it’s available in a few places, but I got the 2016 data from Baseball Musings. And if you’ve got other questions about the projection systems, I wrote a rundown of their methods and differences last year.

The first thing I did was pull each position player with at least one PA in 2016, and eliminate those who did not have a projection from all three of PECOTA, Steamer, and ZiPS. While the breadth of the systems is possibly an important consideration, that’s not what this project is about. I then calculated each players’ performance in five categories — walk rate, strikeout rate, “wOBA” (a version of wOBA using hits only, since not every system splits intentional and unintentional walks), OBP, and SLG — and scaled those stats to the MLB average under each of the projection systems. Then, I compared their actual 2016 performance to their projected performance under each system, including Marcel, and also compared their performance to the average projection of all four systems. Finally, I took the mean of each of those measures of accuracy across systems, weighted by plate appearances, and got a combined figure. (Thanks to this old Nate Silver post and this Tom Tango thread for the basic methodology.)

So, with that long intro out of the way, how did each system handle 2016?

There’s a lot of info there, as well as a few clear takeaways.

Steamer was the most accurate of the three main systems. Steamer, in the aggregate, was the most accurate in four of the five categories, and the second-most accurate in the fifth. While I wouldn’t read too much into a marginal difference (e.g., Steamer’s .079 average OBP error that just barely beats PECOTA’s .080), Steamer did substantially outperform the competition for walk rate, and it was never bad. PECOTA, however, was rarely far behind, and also quite accurate on the whole.

ZiPS didn’t have a good year. ZiPS was the least accurate of the three systems in each of the five categories, and never by a particularly small margin. You don’t want to conclude too much based on a single season of results, but ZiPS didn’t perform very well in last year’s review, either. (I should also note that this is Steamer’s second straight year of leading the pack in convincing fashion.)

Marcel does its job. Marcel wasn’t great, but it was almost always in the neighborhood of accurate. It beat ZiPS in four of the five categories, and even led OBP. Marcel remains very hard to convincingly beat (or even beat at all), despite its simplicity.

Averaging the projections might be a great idea. The “Average” row in the above table is exactly what you would expect: the accuracy of the average of all four systems. It beats all four systems in four of the five categories, and fell short of only Steamer in the fifth. One would expect that an average would rarely be egregiously wrong; it’s surprising to see that the average also tended to be closer to right than each individual projection. This could be a quirk of a single season of projections, but at the very least, it seems to say that the brute-force method of resolving differences between the projection systems is credible.

Marcel’s performance really is extraordinary. Of course, it has a slight advantage, in that we’re only testing its performance on those persons who actually play at the major league level. Marcel also projects you, me, 98-year-old Bobby Doerr, Daniel Murphy’s 30-month-old child, and literally everyone else in the world, to perform at MLB average; presumably, the other projection systems would be much more doubtful of our abilities, and much more accurate. But if you are only concerned with people at or near the major league level, Marcel does a great job, and is very hard to improve upon.

What if we focus on the area where Marcel is presumably weakest: rookies, and the other players who have no recent major league experience? I re-ran the above calculations, limiting the pool of players to those who received an MLB-average projection from Marcel. I’ll refer to this group as “rookies,” though there may be a few non-rookies among them.

Marcel does very poorly in some respects (walk rate and strikeout rate see it fall off the pace set by the other three systems by a huge margin), but surprisingly well in others (it’s the most accurate for OBP and SLG). To reiterate exactly what that means: In 2016, it was more accurate to assume a rookie would have a MLB-average slugging and on-base percentages than to look to any of PECOTA, Steamer, and ZiPS. That’s pretty remarkable. This was also true last year, with Marcel predicting walks/strikeouts very poorly but doing better at on-base/slugging/“wOBA.”

Unsurprisingly, the average error goes up across the board; rookies are harder for each system to predict than their peers with major league experience. ZiPS also redeems itself somewhat, mostly at the expense of Steamer. There’s more fluctuation in these results, and they’re based off a much smaller collection of PAs than the first chart (8,000 vs. 171,000), so I hesitate to draw any conclusions, but it could be that ZiPS has some subject-matter expertise, so to speak, when it comes to rookies, while Steamer has the same but for established players.

There’s only so much information you can wring out of a single season by slicing the sample into smaller and smaller chunks. But I did want to test the systems on one more group of players: stars, the players each system agrees is likely to be very good. These are, by definition, players with a track record that is extraordinary, and figuring out to what degree that extraordinary recent past will carry over into the future is both difficult, and one of the most important things we ask of projection systems. These are players whom teams will give up a lot for, in money or in prospects, and players who can make or break an entire team’s season.

I defined this group to consist of any player that ranked in the top 20 by OPS in any of the four systems. There was broad agreement; the end result is a list of 29 players, with nearly 18,000 2016 plate appearances between them. Here’s how the systems did at projecting those players:

Marcel does generally better than it did with the rookies, and does the best job of projecting elites in “wOBA,” OBP, and SLG. Again, it struggles most with strikeout and walk rates, which could indicate a pattern. Steamer and PECOTA are again fine, though they rarely manage to beat Marcel convincingly. And after its success with the youthful players, ZiPS really struggles with those in the top tier. For whatever reason, the results on walk rate are almost opposite from those in every other category, with ZiPS leading the rest by a large margin.

* * *

This analysis isn’t the end-all, be-all of evaluating these projection systems. This is one way of measuring their accuracy, and it’s based on just one year of results. But we’ve got to start somewhere, and as we repeat this for multiple years and tweak the methodology, we can start to draw firmer conclusions about accuracy and the relative strengths of each system. It’s too early to make sweeping generalizations now, but if ZiPS repeats this performance for several consecutive years, it might be time to start questioning it seriously.

The only thing these results seem to say with any confidence: Marcel is shockingly good. It debuted nearly ten years ago, and these projection systems have existed for that long or longer as well. They still struggle to do any better than the system made to be as simple as possible. Predicting baseball: It’s really, really hard.