Editor’s Note: This piece was initially given as a presentation at the marvelous 2015 Saber Seminar.

Inspiration

The Error. The Unearned Run. Fielding Independent Pitching. xFIP. These increasingly complex concepts are fundamentally about sifting through all of the noise that is present in pitching statistics and determining how a pitcher should have done by acknowledging the factors outside of his control. The notion of ascribing value to players who perform well independent of their teammates and surroundings is what sabermetrics is all about; there is a great deal of value in finding out whether good or bad luck played a role in a player or team’s season. Fortunately for baseball analysts, the sport has an incredibly large sample size to work with; in no other sport is the team with the best record more likely to have earned it rather than lucked into it.

However, sometimes the sample size of 162 games, 600 plate appearances, or 200 innings pitched is not enough for our most commonly used metrics to give an entirely accurate representation of player value. The old adage – that for every bloop that manages to fall in for a hit there is also a line drive hit right at somebody – is not always true for every batter in every season. When the batting average on balls in play against Pedro Martinez rose dramatically for a single year in 2000, it is commonly accepted that he was simply unlucky. By using FIP, it can be seen that nothing significant or predictive about him as a pitcher changed or deteriorated in that one year.

To this end, we see DIBS (Defense Independent Batting Statistics) as an equivalent metric for batters. Using batted-ball profiles in conjunction with results-based statistics may make it possible to detect when a player may have been unlucky, even when their results-based statistics may paint a bleaker picture.

DIBS is a concept we were introduced to by Ben Jedlovec’s presentation on behalf of Baseball Information Solutions in Phoenix this March at the SABR Analytics conference. While BIS has access to proprietary data that helps fuel its calculations for its DIBS, our goal matches BIS’ – create a measurement of batting ability and offensive value separate from outcomes and past results. Ideally, this statistic will hold predictive value; much like FIP can tell us which players may have gotten lucky on batted balls in a certain year, DIBS uses batted-ball profile to predict outcomes independent of defense, park and result. tDIBS (the “t” standing for Tufts University, where we all are currently students) is an attempt to satisfy this goal using only publicly available data.

2014 Model

When originally designing the model earlier this year, we found that Gameday data was the best publicly available source for evaluating 2014 batted balls. Statcast data was not publicly available for the 2014 season except for a select few highlights.

The Gameday data we used included the following variables, as entered manually by a stringer in the press box:

Batted-ball type (Fly Ball, Ground Ball, Line Drive, Pop-Up, and Bunt)

Quality of contact (sharp, soft, neither)

Location fielded – given by (x,y) pixel coordinates from a digital mockup of the relevant park

While these data were imperfect (take, for example, this Hardball Times piece by Colin Wyers on the effect of the press box location on batted ball classifications), it still provided valuable information about a player’s process, as opposed to just his results.

To turn these data into a model to predict batted-ball outcomes, we first needed to convert the pixel coordinates into feet, as the ratio will vary by park. To do this, we looked at plots of all batted-ball locations by park, finding the distance in pixels between home plate and a clearly demarcated wall, and compared that to the actual distance in feet. With these unique multipliers by park, we could then approximate the distance for each batted-ball in 2014.

Using these data directly to predict batted-ball outcomes still leaves us with a large problem: the distances calculated are where the ball was fielded, not where it landed. For ground balls, and to some extent line drives, this leaves the distance as very defense-dependent. A ground ball that gets through the infield will have a large distance compared to one that was fielded for an out. To reduce this problem, we used the publicly available April 2009 HITf/x data set to run a regression predicting batted-ball velocity and vertical launch angle from the Gameday variables (distance and quality of contact, run separately by batted-ball type). With this regression, a ground ball that makes it through the infield will have a slightly more favorable predicted velocity and angle than one that didn’t, but the difference will be much smaller than for, say, fly balls.

Finally, using our predicted velocity and vertical angle for each batted ball in 2014, we ran a multinomial logistic regression to predict probabilities of the five basic outcomes (single, double, triple, home run, or out). For fly balls, we used a simulation from a normal distribution to account for the uncertainty inherent in our imprecise data; this eliminated an issue where our model would predict too few home runs. By summing the probabilities of each type of outcome by player, we were able to come up with an expected DIBS batting line for each player.

With our model finalized, we were finally ready to see who got lucky, who got unlucky, and who performed exactly as they should have last year.

2014 Results

It may be interesting to see which kinds of players our model favors, what biases it may have, and a few examples of its intuitive (and counter-intuitive) findings. What follows are our top 10 largest underperformers and overperformers, using the difference between what our model projected their wOBA to be (“tDIBS wOBA” )and what their true wOBA was (“actual wOBA”), and measured using absolute difference in expected and actual wOBAs: