Short Version

WPA, used as a value metric, is incomplete. You have to build in replacement level, including bullpen chaining, to get the full story. These adjustments, which are commonly accepted parts of WAR, shift value to starting pitchers relative to high-leverage relievers, while keeping the win expectancy framework of WPA at the heart of the value metric. With readily available data we can use some basic algebra to convert WPA to WPA-Above-Replacement, or WPAAR.

Much Longer Version

Thanks to Zach Britton’s near-perfect season, a reliever received real Cy Young support. In the saber community, his candidacy was supported by WPA (Win Probability Added) where he led the majors with +6.2 wins (and Andrew Miller finished second with +4.8 wins.) WPA-backed arguments mimic those that tout Mike Trout as the obvious MVP: “whoever helped their team win the most games is the most valuable player.” The problem with WPA-based MVP arguments, however, lies in the assumption that the WPA leader is the player who helped his team win the most games. Even after deciding that the win probability framework is the one you want to use, WPA is just a win probability metric, not the win probability metric, and, as I’ll lay out below, it’s an incomplete win probability metric.

Background: Win Probability and Leverage

Win probability (or win expectancy) is the baseball version of the little percentages next to the cards in televised poker. Given the state of the game, and an expectation of “all” the possibilities that could occur the rest of the game, how likely is each team to win? Win probability added is the change in win probability after an event occurs. When Jose Bautista hits a home run, the Blue Jays are more likely to win, and that change in expectancy is credited to Bautista. Add up all these little changes over a full season, and you have a player’s WPA.

WPA is very similar to WAA, Wins Above Average, except for how the wins are tallied. WPA uses win probabilities, WAA uses linear weights. In the middle is REW (Run Expectancy Wins). Run expectancy, like win expectancy, uses the game situation to calculate the change of each play. The difference is that run expectancy only takes into account runners on base and number of outs, while win expectancy also accounts for inning and score. Linear weights doesn’t care about context of an event at all, using the average value across all possible contexts. REW, like linear weights, use a runs-per-win converter to translate runs into wins. Win probability starts with wins as the unit.

To summarize:

Metric Summary Linear weights Run expectancy Win probability Championship probability Context/leverage None Runners on base, outs Inning, score, runners, outs Standings, inning, score, runners, outs Question answered On average, across all situations a PA might occur in, how many runs does a single add? How many more runs do we expect to score this inning because of this single? How much more likely are we to win this game because of this single? How much more likely are we to win the World Series because of this single? Common stats wRAA, RAA, WAA (converted to wins) RE24, REW (converted to wins) WPA cWPA cWPA References:

A: http://www.hardballtimes.com/postseason-probability-added/

B: http://baseballanalysts.com/archives/2009/04/championship_wp.php

C: http://www.hardballtimes.com/the-top-10-plays-of-2016-according-to-championship-wpa/

In all of these expectancy metrics, there is an inherent assumption that some situations are more important than others. For example, an at-bat in a tied game in the ninth inning matters more than in a six-run game in the fifth. It matters more because the outcome of the at-bat has a bigger influence on the outcome of the game. Mathematically, the average change in win expectancy is larger in the first example – there are wider swings. The difference between a strikeout and a home run is quite wide in a tied game in the ninth, while the difference is negligible in a six-run game in the fifth. And you know that intuitively, because your heart is racing. This “average change in win expectancy” is known as leverage. Every situation can be assigned a leverage value using similar math to expectancy metrics. Each expectancy metric has its own version of leverage, according to the context it cares about.

If you’ve heard of leverage, it’s most likely the one associated with win expectancy, but there’s also base-out leverage, championship leverage, etc. (Linear weights does not have an associated leverage, since outcomes have no context in linear weights.) FanGraphs reports a few aggregated stats measuring win expectancy leverage. pLI averages a pitcher’s average leverage across all plate appearances. inLI averages leverage across the first pitch of an inning a pitcher started. gmLI averages leverage across the leverages of the first pitch a pitcher makes in a game. exLI cares about the leverage when a pitchers exits. When calculating reliever WAR, wins above average based on linear weights (or FIP or ERA) is multiplied by LI to give relievers who pitch more important innings more credit for their runs prevented.

Background: Bullpen Leverage Chaining

Finally, while closers pitch high-leverage innings and deserve a lot of credit for doing so, their replacements aren’t replacement-level relievers, but instead are setup guys. When a closer goes down, the guy added from Triple-A is given mop-up duty, not the closer role, while everyone else moves one step higher on the ladder. The closer is replaced by the setup guy, the setup guys is replaced by the 7th inning guy, all the way down the line. All those little changes add up to yield the actual value of the closer. To account for this, we give half credit for the higher leverage innings of good relievers. Why half? Because that’s what makes the math work out – there’s a longer explanation and an example calculation here if you are interested in said math. Closers usually deserve to close because they’re excellent relievers, but replacing them with setup guys doesn’t hurt the team as much as their raw leverage and WPA numbers suggest.

Background: Replacement Level

Again, what all these probability/expectancy stats have in common is that they are relative to average. You can interpret that as the league summing to zero net wins, or that each player is compared to an average player. But we don’t use wins-above-average very often, because it’s incomplete. It doesn’t account for the value that an average player provides over a replacement level player. It says that a 0 WAA player over 10 plate appearances was just as valuable as a 0 WAA player over 600 plate appearances. But you’d rather have the second player, because the first requires you to find another 590 plate appearances at league-average rate. That’s not easy, and not cheap. That’s the reason why we usually use WAR (Wins Above Replacement), building in the value of an average player above and beyond that of a replacement level player. This can be more than a two-win difference for full-time players.

Relative to above-average stats, above-replacement stats reward additional playing time. This shifts value from relievers to starters, because starters pitch more innings. Additionally, the replacement level for relievers is better, because performance improves moving from starting to relieving (and vice versa). This adjustment isn’t too dissimilar from park adjustments, accounting for the difficulty of the job each player does. Relieving is easier than starting. The advantages of pitching in relief include throwing harder, using only your best pitches, and facing hitters only once per game. Most relievers are failed starters. Justin Verlander has a career 3.47 ERA as a starter, but can you imagine what his ERA would be as a reliever, going just one inning at a time? Research has shown that the typical pitcher would have an ERA almost a full run lower in a relief role than as a starter. Strikeouts increase about 17 percent, home runs per batted ball decrease about 17 percent, and BABIP decreases by about 17 percent. Replacement level for relievers is about the league average ERA, while replacement level for starters is about a full run higher. One run of ERA over 180 innings is a difference of 20 runs, or about two wins. That, not coincidentally, is the value of a league average starter: two wins.

As you can probably guess, these adjustments comparing an average player to a replacement level player significantly decrease the value of high-leverage relievers when judged solely by WPA. But these are all adjustments that we already make in WAR and are commonly accepted. By using win probability above replacement, we’re still giving bullpen aces lots of credit for their higher-leverage performances, just not as much as raw WPA claims.

The New Stuff: Converting WPA to WPAAR

So, what’s the solution? I’m going to call it Win Probability Added Above Replacement, and calculate it using the 2016 versions of Zach Britton and Jon Lester (the top starter by WPA) as examples. The two main adjustments are for bullpen chaining and differing replacement levels of starters and relievers.

Start with WPA. For Britton, this is +6.1. For Lester, this is +4.6. Because the former is a higher number than the latter, many people make claims like “WPA says Zach Britton was more valuable than Jon Lester.” The purpose of this article has been to highlight the context missing in that interpretation of the two numbers. I’d go so far as to say it’s plain wrong.

Adjust pLI (leverage index) halfway toward 1. This is the bullpen chaining adjustment. For Britton, a pLI of 1.8 becomes 1.4. For Lester, .94 stays at .94, since he’s a starter. WPA is giving Britton full credit for the situations he pitched in, when he really only deserves half.

Move WPA toward average (zero) by the ratio of LI_adj/LI. For Britton, that ratio is 1.4/1.8 = 78%, and 78% * 6.1 = 4.8. For Jon Lester, no change from +4.6. Because Britton only deserves half credit for the high-leverage situations he finds himself in, his WPA is adjusted down.

Credit the player for the value of league-average performance over replacement level. For starters, that’s about 2 wins per 180 innings. So Jon Lester gains 2*202/180 = 2.2, for a total of 6.8 WPAAR. But since reliever replacement level is approximately league average, there’s no extra credit for Britton. He stays at +4.8.

In total, Jon Lester gains 2.2 wins due to replacement level, while Britton loses 1.4 wins due to replacement level and chaining. Britton’s 1.5 win lead in WPA over Lester becomes a 2.0 win deficit in WPAAR. Here’s the top 25 leaderboard from 2016.

If you want to see the whole list, which displays more of the data, you can see it here.

Additional Notes

Now, I don’t actually suggest using WPA for starting pitchers, as their leverage is heavily dependent on run support and timing of the runs scored in the game, which are clearly not pitching skills (for more on not using WPA for starting pitchers, read these three pieces at The Book blog). A better approach is to use a different, more traditional WAR metric for starting pitchers, even if you want to compare them to the WPAAR numbers of relievers. If we remove starting pitchers from the WPAAR leaderboard above, here’s how relievers stack up:

Additionally, WPA does a poor job of parsing defensive credit between pitching and fielding (as in, it doesn’t do it). A fielder making a great play is credited to the pitcher under WPA, when really the pitcher should be held accountable for the quality of the batted balls he gave up, while the fielder is credited or debited value from that point depending if he makes the play. With the growing popularity and availability of Statcast data, this splitting of WPA credit between pitchers and fielders might be possible.

Conclusion

After adjusting WPA to account for replacement level and bullpen chainging, Zach Britton remains one of the top five most valuable pitchers in the American League in 2016, and only Justin Verlander is significantly ahead of him. But the lead he held in WPA has disappeared. WPA is a fine metric, but it’s incomplete. You can’t forget replacement level and all of its repercussions. With WPAAR, I think we have a metric that is more closely aligned with a pitcher’s true value.

References & Resources