CHICAGO, UNITED STATES: Los Angeles Lakers guard Kobe Bryant(L) and Chicago Bulls guard Michael Jordan(R) talk during a free-throw attempt during the fourth quarter 17 December at the United Center in Chicago. Bryant, who is 19 and bypassed college basketball to play in the NBA, scored a team-high 33 points off the bench, and Jordan scored a team-high 36 points. The Bulls defeated the Lakers 104-83. AFP PHOTO VINCENT LAFORET (Photo credit should read VINCENT LAFORET/AFP/Getty Images)

Over the past two decades, single-value basketball metrics have been popping up like weeds. The first composite stat to really take hold was John Hollinger’s Player Efficiency Rating (PER), which has been followed by further advancements on the box score along with metrics that harness plus-minus data, like ESPN’s Real Plus-Minus (RPM). Besides feeling like alphabet soup, what do these all-in-one stats really measure, and how valuable are they for player analysis?

To investigate this, I grabbed the following metrics for every player-season since 1997 (unless otherwise noted) in attempt to glean value and a deeper understanding of these numbers:

I started by looking at a metric’s stability from year-to-year. It’s helpful to understand how team circumstances can influence a statistic, so I compared each metric’s variability when a player changes teams to when he remains on the same team with similar teammates — in this case, defined as having 85 percent of the same minutes return from the previous season, per Basketball-Reference’s roster continuity calculator. In other words, if LeBron James averaged 27 points per game last year, should we expect him to average 27 per game this year? And what should we expect if he changes teams?

Since different metrics are on different scales, I’ve charted variability in standard deviations, ordered by the most stable metrics when playing with similar teammates:

Unsurprisingly, points per game is the most consistent of these numbers from year-to-year when teams remain largely unchanged. It’s also the most consistent when players move to new teams, which makes sense given that most 15-point per game scorers don’t suddenly morph into 9- or 21-point per game players (the equivalent of a one standard deviation change). PER, which is heavily influenced by volume scoring, clocks in as the second-most change-immune metric. On the opposite end of the spectrum is the dummy statistic, iSRS, which is incredibly sensitive to new team changes. This is by design: iSRS treats all starters as roughly the same, regardless of how they actually played.

There’s nothing inherently good or bad about more stability. For instance, imagine a volatile metric that says Isaiah Thomas was incredibly valuable in Boston but not so valuable in Phoenix. This could be accurately capturing how differing circumstances or roles dictate value, or it might be inaccurately tied to his teammate’s performance, much like iSRS. On the other hand, a metric that is insensitive to team change might be accurately measuring a player’s overall “goodness” (irrespective of circumstantial value), or it might be rigidly measuring something else that is consistent, but inaccurate.

What’s noteworthy here is the ratio between new team and “same” team variability. Wins Produced (WP) and Win Shares (WS) have a nearly 2:1 ratio, whereas stats like PIPM and APM are fairly similar in variability, regardless of new teammate circumstance. This implies that WP and WS are either capturing the real-world volatility of player value — think of a scale bouncing around yet still accurately measuring weight — or inaccurately allocating credit based on how well a player’s teammates play.

Retrodiction testing

To figure out what’s going on here, I extended an old Neil Paine study (who himself expanded on similar studies) by using prior season values to predict teams in the following year. The idea is to take each player’s rating in a stat and predict his contribution in the next year based on the number of minutes he ends up playing. For instance, if Win Shares says LeBron James was responsible for a win every 200 minutes in 2018, and he plays 2,000 minutes in 2019 for the Lakers, he’ll be worth 10 wins for them. Add that number to his new Laker teammates’ projected wins (based on each of their 2018 Win Shares) and we can predict the 2019 team performance. While this method accounts for injuries, it doesn’t control for aging or other similar confounders, but every metric in the test is susceptible to these flaws.

There’s actually a larger issue with those prior tests that I wanted to suss out. Imagine a metric that thinks Kevin Love was much better than LeBron James while they were in Cleveland together. It uses 2015 to predict 2016, 2016 to predict 2017, and so on. Since the Cavs returned a similar team each year, they finished with a similar record and thus, the metric will accurately predict the Cavs, all the while claiming that Love is the driving force behind the team, not James. It’s only in 2019, when the rosters were shuffled significantly, that its claim about Love and James could be exposed (e.g. if Cleveland fell apart with Love but Los Angeles thrived with LeBron). In other words, if a metric cannot reshuffle its players and accurately predict new player combinations, then it’s likely misallocating credit in the first place.

So unlike prior versions of this test, I focused on new teammate combinations, such as James to the Lakers. I divided teams into different groups based on how much of their “core” lineup returned from the previous season. Core players drive scheme and strategy and are often fulcrums of interactive effects in basketball. Golden State is a good example today: Stephen Curry, Klay Thompson and Kevin Durant have been the centerpiece of the Warriors attack for years, while various role players around them wind up with similar shots and responsibilities. Using the continuity of the entire roster would mask whether their player churn was from peripheral parts or the main engines themselves.

For these tests, I also converted all metric-values to the same margin of victory (SRS) scale via a regression. If a metric predicts a team will finish with an SRS of -5 based on its player values from the previous year, and they actually finish with an SRS of -3, then the metric is off by 2 points. If we do this for every team, we can calculate an overall average error for each metric. Below, I’ve plotted the weaker performing group of statistics, with lineup continuity running along the x-axis. The teams with lower continuity — when the core part of the roster is shuffled more — are on the left side of the graph:

(*The graph technically should not have lines, but I’ve connected the bins to clarify the trends.) Let’s start with the right side of the chart, when teams essentially return the same core. PER and Win Shares peg those teams, on average, within 3 points of their following year performance. (The error is actually RMSE, for the technically inclined.) All else being equal, that’s pretty good; it means that, for cores that stay together, we can take the PER for an entire team, assume each player will perform exactly the same in the following season, and that team will be within 3 points of its outcome, on average.

But look at what happens as the core is changed more and more — PER struggles to predict the new combination of players. Now look at our dummy metric, iSRS, which uses only minutes and the team’s result to evaluate players. Like PER, it’s good at dividing credit when the same core returns. As expected, iSRS also falls apart when the core changes more, largely because high-minute role players riding the coattails of stars can’t replicate the same iSRS value on other teams. Yet it still outperforms PER here! PER may be consistent from year-to-year, but it’s consistently misallocating credit.

Another baseline metric, if you will, is points per game. It’s the only metric I tested that looks better when the lineups are shuffled because it’s incredibly uninformative when lineups remain intact. This might seem counterintuitive — why would bringing back the same core make points per game less predictive? — but there’s something revealing here. The majority of those high-continuity teams are actually good; four in every five teams with a core continuity above 88 percent posted positive SRS’s, and points per game is borderline useless for predicting good teams. Winning teams returning the same core do not compile volume scorers, but instead differentiate with their playmaking, defense and role-player interactions. iSRS arguably outperforms points per game here, so it might be wiser to cite minutes-on-a-good-team instead of raw points as an indicator of player quality.

Like PER, Win Shares and Wins Produced are far less predictive with lower continuity lineups. These were the two stats that were most sensitive to new teammates when we examined year-to-year stability, yet it appears their volatility is based more on teammate performance than accurately evaluating players. It’s implausible that these stats are correctly pegging a player’s circumstantial value — that Kevin Love really was that indispensable among his specific teammates in Cleveland and then his value shrunk in 2019 without them — because this would have to happen while the inverse occurred for LeBron in LA, and then that phenomenon would need to repeat itself among hundreds of critical player-seasons. (If you’re wondering about sample size, there were 225 teams in the under-72 percent continuity group.)

The best composite metrics (today)

I’ve plotted the stronger performing set of composites on the same scale below, where “BP” is my Backpicks BPM model and “BBR” is Basketball-Reference’s version. Notice how much better this group does handling new player combinations:

Another major takeaway here is how relatively close all these stats are, especially compared to some of the Old Guard metrics from the previous group. Augmented Plus-Minus (AuPM) is designed to mimic adjusted plus-minus (APM), but it actually performs better under these conditions. This is not because it’s a superior stat, per se, but because plus-minus data is noisy and lacks the box score information used to stabilize AuPM. Unlike the previous metrics, APM doesn’t completely fall apart when lineups are shuffled, exhibiting a much softer decline in prediction error as lineup continuity is lowered.

Jacob Goldstein’s PIPM uses a luck adjustment to minimize some of the noise in that stat’s plus-minus data, and it regularly outperforms AuPM. Perhaps most interestingly, my own Box Plus-Minus model (BP) was designed to be more team-independent, and it “wins” this test when comparing lower continuity teams, despite lacking plus-minus data for players. And, when PIPM removes its plus-minus data from the equation (before 1994), PIPM looks best, slightly outperforming both of the Box Plus-Minus models. We can see this clearly when looking at the results since 1978, long before the plus-minus era (the previous graphs spanned 1998-2018):

These are all upgrades over the best of the Old Guard metrics, Win Shares. Without delving into methodological differences, it’s noteworthy that the “winner” of this type of test is a stat that removes the noise of plus-minus. This includes ESPN’s RPM, which I’ve omitted due to sample size (it’s only five years old in its current version). RPM looks about as good as these other top metrics in these tests, but certainly not better based on a few seasons of information. So is plus-minus confounding the equation by adding noise?

The Holy Grail of metrics

Since 1997, there have been 41 teams with an SRS of at least 4 and a lineup continuity of at least 88 percent. Among those teams, APM is well ahead using this testing method, with an average error of only 2.4 points per squad. (PIPM and AuPM are tied for next best at 2.8.) The sample isn’t too large, and there’s no reason to put too much stock in predicting only good teams, but clearly APM is capturing something worthwhile, without any box score information.

The improvements in box score composites themselves, like Basketball-Reference’s BPM, have come directly from mapping to plus-minus studies in order to discover valuable statistical patterns. These are strong indicators that the plus-minus family is measuring something quite important, even if it can be difficult to tease out the noise at times. Furthermore, while box stats are helpful and relatively stable, combining them with plus-minus doesn’t necessarily bring us closer to a Holy Grail player metric. (!)

I think there’s something larger going on here, and it speaks to the future of all-in-one stats in general. “Value” describes the impact a player has in a given circumstance (e.g. a big man on a team without any backups is more valuable than if he joined the Tim Duncan–David Robinson Spurs). But “goodness” is about a player’s average impact from all relevant team scenarios. Are these metrics measuring circumstantial value or overall goodness?

When the lines are blurred, it’s not clear that there is any information gain. We’ll never be able to nail a retrodiction test like the ones above with physics-like precision, because players are injured, grow older and susceptible to variability in physiology. And I don’t think that needs to be the goal. Instead, I hope that the future of composite metrics will honor the distinctions between valuable and good, allowing us to say “with two great shooters and a dominant interior defender, LeBron James’s situational value is worth X points.” For as much as all-time rankers and talking heads want to argue about overall goodness, the coaches, general managers, and certainly many fans crave stats that answer the question “how will this player help this team?”

What we have today is still incredibly helpful — we can see this in how much smarter forecasting systems are. Many of these stats are picking up signals that clearly reflect player performance, and when they disagree, they provide fodder for discussion and more granular analysis. And maybe that’s the ultimate takeaway here: The best one-number metrics — BPM, PIPM, RAPM, and (probably) RPM — are actually quite good at ball-parking player value, but until we can generate more context-specific numbers, we’re still going to have to do the legwork of analyzing circumstances manually to filter out some of the noise.

That, and maybe we can permanently retire PER now.

Ben Taylor is the author of Thinking Basketball. He runs a podcast and YouTube channel by the same name, and once ranked the 40-best careers in NBA history.