Just the f(a)x: Why I think MLB is juicing baseballs (Part one)

If it seems like home run rates are off the charts this year, you’re not crazy. The uptick is real: 2.81% of plate appearances resulted in a home run this year, versus 2.5% through this date last year. When looking only at balls in play, that number jumps to 4.5% versus 3.9%. IMO there are only four possible explanations:

A statistical aberration, Pitching quality has become suddenly and markedly worse, Hitters have started doping at alarmingly high numbers relative to last year (and are mostly getting away with it), Baseballs are juiced.

And at the risk of sounding like a crazy person, I’ll just come out with it: I find number four to be the most plausible. Why? Well, number 3 does not pass the smell test, and I’m about to debunk the crap out of number one, meaning either pitching has gotten far worse very quickly, or something is fishier here than Tom Brady’s balls. Although there is some argument for a drop in pitching efficacy lately, particularly among prominent aces, I’m not biting. My intuition (or strong prior, if we’re being Bayesian about it) tells me that when there’s money to be made from higher scoring games, the powers-that-be will do what it takes to make scoring happen. In a future post, I’ll take a deep look at pitchFX data and see what can be learned, but for now, lets rule out the “aberration theory.”

The Evidence.

Let’s start by looking at year over year batted ball velocity for 2016 versus 2015. Below is a density plot of all batted balls from opening day through May 12 for each year (a total of 22,787 underlying data points from 2015 and 23,347 from 2016):

Notice the obvious shift to the right- statistically, this kind of difference is VERY unlikely to be an aberration. A KS test, which measures whether two distributions differ significantly, results in a p value of 0.000001408.

> filter1 <- dat$year == 2016 & dat$numericDate <= 20160513 & dat$numericDate >= 20160404 > filter2 <- dat$year == 2015 & dat$numericDate <= 20150513 & dat$numericDate >= 20150405 > ks.test(dat$mph[filter1],dat$mph[filter2]) Two-sample Kolmogorov-Smirnov test data: dat$mph[filter1] and dat$mph[filter2] D = 0.024784, p-value = 1.408e-06 alternative hypothesis: two-sided

Of particular note, the top 5% hardest hit balls were leaving the bat about 0.6 mph faster this year versus early 2015 (a mean of 109.53 vs 108.92).

Interestingly, there had already been chatter surrounding an unexplained increase in power last fall, when the home run rate spiked in August through October. But even when we compare 2016 data to late-season 2015, the velocities this year are still higher:

The p value comparing late season 2015 to YTD 2016 is 0.014, strongly suggesting significance.

Note that in the first comparison above, I only compared year over year data through the current date, so as to minimize confounding weather effects. Warmer temperatures translate to faster ball speeds, so in theory, the warmer weather towards the end of the year would have contributed a boost to the velocity in late 2015 compared to early 2016. This means the true increase this year is even bigger than suggested in the second plot. Ideally, we’d like to account for this effect in the analysis.

When looking at the average temperatures for early 2015, late 2015, and early 2016, we do in fact see an expected trend. Namely, the average game day temperatures in both April and May have been comparable between the two years, when considered in aggregate, and were higher in late season 2015.

So just to be completely thorough, I modeled the off-bat velocity as a function of temperature and temperature squared, and then adjusted each data point’s observed velocity based on game day temperature.

> T <- dat$temperature > T2 <- T^2 > fit <- lm(dat$mph ~ T + T2 ) > summary(fit) Call: lm(formula = dat$mph ~ T + T2) Residuals: Min 1Q Median 3Q Max -70.561 -7.890 2.439 10.419 45.396 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 90.6842239 1.0842027 83.641 <2e-16 *** T -0.0650583 0.0311614 -2.088 0.0368 * T2 0.0004951 0.0002217 2.233 0.0255 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.93 on 141763 degrees of freedom Multiple R-squared: 4.528e-05, Adjusted R-squared: 3.117e-05 F-statistic: 3.209 on 2 and 141763 DF, p-value: 0.04039 > correctedMPH <- dat$mph - (T*coefficients(fit)[2]) - (T2*coefficients(fit)[3])

Running a KS test on these adjusted velocity distributions for this year’s data versus early/late 2015 returned p values of 9.681e-08 and 1.733e-05, respectively – even more significant than the unadjusted tests. Meanwhile, the density plots were visually indistinguishable from the previous ones:

Hmmm. Ok, so balls are DEFINITELY leaving the bat faster this year, but how much will an extra 1 mph actually increase the home run rate? Well, quite a bit actually. I’ll give a back of the envelope calculation here. Let’s compute the expected rate of Home Runs per Ball in Play (HR/BIP) as follows:

E(HR/BIP) = ∑ Pr(ball gets hit at velocity v) * Pr(ball hit at velocity v results in a HR)

where the sum is taken over all velocities v seen in the data (in practice, this range is roughly between 50 and 120 MPH). This analysis is, in a number of ways, more accurate than just looking at the actual rate of HR/BIP observed in a given time frame, as it removes much of the “luck” involved in observed home run rates. For instance, if balls are hit with efficient/inefficient launch angles over a short period of time, this will alter the actual home run rate, but in the long run, would be expected to even out. In other words, observed home run data is very ‘noisy,’ while batted ball velocity contains more signal, and is ‘cleaner.’

Note that the first term in the summation on the right hand side will vary depending on what subset of games we look at, while the second term is relatively invariant. Regardless of whether hitters are doping, pitchers are bombing, or baseballs are juiced, if all other things are roughly equal – and the distribution of launch angles, backspins, etc are comparable – a ball hit at a given velocity should result in a home run at a rate that stays constant. (One possible exception is the case where in-flight mechanics are altered -for instance, if balls are being juiced by way of lower seams in their construction, leading to less drag in flight. But I’m going to ignore this possibility for the time being; if anything, it will lead to a more conservative estimate of ball juicing).

All this is to say that we can increase sample size and accuracy of the calculation, by using the full data set to compute the second terms, while the first terms will restrict to just the games occurring in a given timeframe.

Carrying out the calculation:

> runningSum1 = 0 > for(i in c(0:130)){ + if(sum(dat$mph == i)>0){ + runningSum1 <- runningSum1 + (sum(dat$mph[filter1] == i)/sum(filter1))* ( sum(dat$mph == i & dat$result == "Home Run")/sum(dat$mph == i)) + } + } > > runningSum2 = 0 > for(i in c(0:130)){ + if(sum(dat$mph == i)>0){ + runningSum2 <- runningSum2 + (sum(dat$mph[filter2] == i)/sum(filter2) )* ( sum(dat$mph == i & dat$result=="Home Run")/sum(dat$mph == i)) + } + } > > runningSum3 = 0 > for(i in c(0:130)){ + if(sum(dat$mph == i)>0){ + runningSum3 <- runningSum3 + (sum(dat$mph[filter3] == i)/sum(filter3) )* ( sum(dat$mph == i & dat$result=="Home Run")/sum(dat$mph == i)) + } + } > > runningSum1 [1] 0.04626982 > runningSum2 [1] 0.0418485 > runningSum3 [1] 0.04437989 > # ratio of sum1 to sum2 = increase of 2016 start of season over 2015 start of season > runningSum1/runningSum2 [1] 1.105651 > # ratio of sum1 to sum3 = increase of 2016 start of season over 2015 end of season > runningSum1/runningSum3 [1] 1.042585

we see that the expected HR/BIP this year has risen 10.5% over the start of last year, and is even 4.26% higher than the warm late-season months of last years scoring spike.

Of course, this really just discredits the notion that the recent upswing is a statistical aberration, and does not discredit the argument that pitching declines may be the true culprit here. To reiterate my personal Bayesian prior regarding capitalism and corruption in sports – I’m inclined to believe we are living in the juiced ball era, and it will require some convincing data to persuade me otherwise. But when it comes to these kinds of things, I try to be fair, and I won’t make any strong claims until I’ve taken a deep dive into recent pitchFX data.

Stay tuned.