One might argue that xFIP is slightly misaligned. One might make that argument in blog form, on the website FanGraphs, today, here, now.

One only might argue that xFIP is slightly misaligned because xFIP is commonly understood to serve a purpose distinct from FIP. FIP, aka Fielding Independent Pitching, is calculated as a function of strikeouts, walks, and home runs — that is, outcomes over which fielders bear no influence. The equation that underpins FIP is derived from a linear regression equation intended to resemble ERA, for ease of interpretation. Because it is based exclusively on outcomes, its purpose is more descriptive than predictive. In other word, it finds greater purpose describing what should have happened but not necessarily what will happen.

xFIP, on the other hand, seeks to achieve the inverse. A large swath of evidence exists to suggest home run-to-fly ball rate (HR/FB) for pitchers is incredibly noisy season to season. Sure, certain pitchers might anecdotally buck the norm — apparently, Michael Pineda was born to be a cafeteria lunch lady, serving up meatballs and taters — but, by and large, HR/FB is a fool’s errand to predict. Accordingly, xFIP replaced home runs with expected home runs, by way of multiplying the number of fly balls allowed by a pitcher by the league-average HR/FB, thereby normalizing home run damage, making it, in theory, a better descriptor (and perhaps a better predictor) of pitcher performance over time.

And therein lies the rub, although, if you missed it, you mustn’t be blamed.

HR/FB, the backbone of xFIP, is inherently flawed because:

Home runs are never hit on infield fly balls (aka pop-ups), yet pop-ups are included in HR/FB (because pop-ups are included in all fly balls*); and Home runs occasionally are hit on line drives, yet line drives are not included in HR/FB.

*Fly ball percentage (FB%) includes both infield and outfield fly balls.

This has bothered me a long time, this seemingly minor but potentially substantial ideological discrepancy. If we seek to normalize home run behavior for pitchers, we should endeavor to do so in a way that is most theoretically appropriate. I’m not here to reinvent the wheel — there are significantly more complex ways to normalize home run behavior — but I, at least, can pick the low-hanging fruit and make subtle adjustments to existing metrics.

I sampled all qualified pitcher-seasons from 2010 through 2018 (n = 709) and calculated unique ratios of home runs to outfield fly balls and line drives. This abbreviates to HR/(oFB+LD), which doesn’t quite roll off the tongue, but it’ll do in a pinch.

Although HR/(oFB+LD) generally behaves proportionally to HR/FB…

HR/FB vs HR/(oFB+LD) Season HR/FB HR/(oFB+LD) 2010 9.4% 6.8% 2011 9.7% 6.7% 2012 11.3% 7.5% 2013 10.5% 6.9% 2014 9.5% 6.3% 2015 11.4% 7.5% 2016 12.8% 8.5% 2017 13.7% 9.3% 2018 12.7% 8.5%

… HR/(oFB+LD) cannot be directly substituted for HR/FB in FIP or else it breaks the equation. If I simply plug in HR/(oFB+LD), all pitchers would suddenly “underperform” their ERA because their xFIPs would improve by several tenths of a run without merit.

To account for this difference, I ran a fresh regression, setting up the equation exactly as specified by xFIP, by virtue of FIP. (I also included year fixed effects, which is a component of FIP and xFIP, too, appearing in the form of the “year constant” term.) However, in lieu of normalizing home runs by all fly balls, I normalized home runs by, yes, outfield fly balls and line drives — the only batted ball events that can produce home runs (inside-the-park home runs notwithstanding).

The regression produced an adjusted r2 of 0.55 — weaker than FIP (r2 = 0.62) but a good deal stronger than the original xFIP (r2 = 0.42). What this means is, from a purely descriptive standpoint, xFIP that relies on outfield fly balls and line drives is a better description of “skill” (or “deserved” outcomes) than xFIP that relies on outfield fly balls and also pop-ups but not line drives.

From a predictive standpoint (by measure of correlation to next-year ERA), the original xFIP outperforms the new xFIP, but not significantly, resulting in r2 values of 0.20 and 0.18, respectively. Even SIERA hardly outperforms the new xFIP (r2 = 0.20), and its descriptive (same-year) value is notably weaker (r2 = 0.36).

Adjusted r2 ERA y+1 ERA FIP 0.62 0.17 xFIP 0.42 0.20 SIERA 0.45 0.20 New xFIP 0.55 0.18

FIP, xFIP, and SIERA all boast ideological differences, yet none prevails as a superior predictive option. That “new xFIP” lands in the middle of an indistinguishable pack predictively while also prevailing above xFIP and SIERA descriptively lends merit to the original argument — the argument that xFIP might be theoretically misaligned, such that it artificially restricts its descriptive power.

You could calculate New xFIP manually using the equation above, if you’d like. (The constant term for 2019, as of now, is something like 0.834. I say “as of now” because it’s a moving target — as league-wide ERA changes, so, too, does the constant term.)

If not, you could make mental adjustments to xFIP in its current state according to a couple of intuitive rules of thumb. How does a pitcher’s line drive rate (LD%) compare to the league average? If it’s higher, then xFIP might be overrating his performance; if lower, then underrating. Same with infield fly ball percentage (IFFB%): if it’s higher, xFIP might be underrating his performance, and vice versa. It’s inexact, but, to be fair, all of this (gesturing broadly to sabermetrics) is inexact.

* * *

Here’s the “New xFIP” leaderboard, as of Sunday, May 12.

* * *

[Edit (5/21/19 8:32 pm ET)] It should be noted all calculations below relied on Statcast data rather than FanGraphs data. I had a moment of panic when I realized FanGraphs and Statcast data do not perfectly align in terms of how batted ball events (fly balls, etc.) are strung/coded. Fortunately, I am also able cross-validate the results below using FanGraphs data. The change to xFIP I recommended below still bears a substantial improvement in xFIP’s correlation with same-year ERA; its adjusted r2 improving from 0.44 to 0.53 — not exactly the same values shown below, but darn close. That’s all. Thanks![/Edit]