I am intrigued by the idea of stabilizing BABIP and its awful amount of variance year to year. The first step in my process begins here, by finding the rate in which a ballplayer reaches base via LD, FB, and GB.

One reason why you should care is because a large amount of the variability in BABIP year to year, lies within the volatility of the actual batted ball info that it measures. For one, BABIP for LD, FB, and GB individually possess an incredible amount of noise year to year (which we will get to later).

Thus, it came to my attention that possibly there is a way to measure nearly the same thing as BABIP, but with much less noise year to year.

To do so, I will apply Bayes' Rule to BABIP to find this hidden batted-ball rate. (Note: for those who are not sure of what Bayes' Theorem is the formula is below:

BAYES' RULE

P(A | B) = P(B | A) P(A)

P(B)

The symbol "|" signifies a conditional probability.

For a simple example, imagine you want to find the probability that Dominic Brown walks given that its a Tuesday. That probability would look like this P(Walk | Tuesday). And of course the answer would be never or zero.

Introducing pLD, pFB, pGB: Accurately predicting batted-ball info Here we created the predictive models for batted-ball info.





APPLYING BAYES' RULE

Lets get started!

To do this all we will need is LD%, FB%, GB%, OBP, and LD_BABIP, GB_BABIP, and FB_BABIP. All totals were found in Retrosheet.

We know the probability for each player to get on base is their OBP, this will be our P(A) or our "prior".

The posterior we want to find is P(A|B), where "A" is the batted ball we want (LD, FB, GB) and "B" is the probability of getting on base. In other words, our posterior is the probability of the selected batted ball (LD,FB,GB) given that the player made it on base.

Now hopefully you can see why this is different than just BABIP for LD's, and that once we isolate this number, there will be much less statistical noise.

Let's piece it all together using Bayes Rule above, using Line-Drives as an example:

P(LD | OB) = P(OB | LD) P(LD)

P(OB)

Here, OB is just the probability a player reaches base. We will call the product of this formula bLD,

We wil do the same when selecting for FB and GB, when we find bFB and bGB.

RESULTS

Now for the moment of truth. Using data from 2008 to 2012, with all players having 250 PA and above we will see just how well bLD, bFB, and bGB correlates year to year, and how it compares to BABIP for LD, GB, and FB individually.

Here are the raw results from the above study:

BABIP R^2 bMetrics R^2 DIFF LD_BABIP 0.031 bLD 0.206 0.175 FB_BABIP 0.331 bGB 0.516 0.185 GB_BABIP 0.181 bFB 0.396 0.215

As you can see, the bMetrics are huge improvements in decreasing the noise year to year in BABIP as it is. For one, the year to year correlations are not perfect but are vastly improved. In a sense, the bMetrics are calculating nearly the same thing as BABIP, but instead we are looking at the probability of each outcome when a player makes it on base rather than the probability of geting on base when hitting a certain batted-ball type.

These bMetrics will come in handy later on when we try to predict BABIP as a whole. Since there is less noise in these metrics in comparison to LD_BABIP and etcetera, they will be useful for when we convert projected batted ball info into BABIP.

If you still don't understand the why this new method is useful let me explain it to you in a simple way:

In 2012, Mark Trumbo led the league in LD_BABIP at .857. At the same time he had the 9th lowest bLD at 29%.

Hope you get the picture, that bMetrics combine the rate in which a player reaches base via a LD, FB, or GB, not the batting average in which they hit a LD, FB or GB. So we can isolate a player's actual propensity to both hit a certain batted-ball type and reach base safely.

In closing here is a sortable table you can play with while you wait for the next installment of the series where we will implement these metrics with our previous pLD, pFB, and pGB models to predict BABIP.

CLOSING RESULTS

(sortable table)

All statistics are from Fangraphs, or the Lahman and Retrosheet database

Follow @MaxWeinstein21



