BABIP is a really hard thing to predict for pitchers. There have been plenty of attempts, sure, but nothing all that conclusive — probably because pitchers have a negligible amount of control over it. So naturally, when I found something that I thought might be able to model and estimate pitcher BABIP to a high degree of accuracy, I was very excited.

My original idea was to figure out the BABIP — as well as other batted ball stats — of individual pitches from details about the pitch itself. Velocity, movement, sequencing, and a multitude of other factors that are within the pitcher’s control play into the likelihood that a pitch will fall for a hit (even if to a very small degree). But much more than all of those, pitch location seems to be the most important factor (as well as one of the easiest to measure).

I got impressively meaningful results by plotting BABIP, GB%, FB%, wOBA on batted balls, and other stats based on horizontal and vertical location of the pitch. So I came up with models to find the probability that any batted ball would fall for a hit with the only inputs being the horizontal and vertical location (the models worked very well). I even gave different pitch types different models, since there were differences between, for example, fastballs and breaking balls. I found the “expected” BABIP of each of each pitcher’s pitches, and then I found the average of all of those expected BABIPs — theoretically, this should be the BABIP that the pitcher should have allowed.

There was absolutely no correlation to actual BABIP allowed, and there was even less correlation between one year’s “expected” BABIP and next year’s BABIP. My idea didn’t work at all. I was pretty crushed, since I thought that I’d be able to find something revolutionary with this new method. But I did find some pretty interesting things along the way, which is why I’m writing this article.

First, some details about my methodology. The way I did this was to group all pitches of a certain vertical or horizontal location together and calculate the BABIP, wOBABIP, GB%, or what have you of all the pitches in each bucket (which had widths of 0.1 feet); I then came up with an equation to model those stats based on the results. So for each pitch, you can roughly expect what the wOBABIP will be solely from its location — average all of a pitcher’s expected pitch wOBABIPs and you have his expected wOBABIP. This is a different methodology from most other ways of modeling BABIP and other similar stats, which look holistically at a pitcher’s whole season (or some other chunk of time) and draw conclusions from that.

My data was all from pitchf/x, for which we have information going back to 2007. Errors and fielder’s choices, unfortunately, had to be excluded from all calculations as the MLBAM play description strings — which contain the batted ball type — don’t mention what kind of batted ball it was on a reached-on-error or fielder’s choice. Instead, they say something like “Player X reaches on fielding error by Player Y,” whereas a normal play description string would say something along the lines of “Player X singles on a ground ball to center fielder Y”. Additionally, all bunts were excluded.

You may also notice that popup percentages (PU%) are very high compared to the FanGraphs numbers, which is because MLBAM data has a more liberal definition of popups. (An aside: I tend to prefer popup rate — that is, popups divided by all balls in play — to infield fly ball rate (IFFB%), which is popups divided by just fly balls, as it is more stable and there is really no similarity between a popup and an outfield fly ball. For that reason, I am also going to use OFFB% — outfield fly balls divided by balls in play –instead of FB%. OFFB% + PU% = normal FB%. I refer to OFFB% as FB% throughout this post.) HR/FB% numbers may also seem high because I am defining that as home runs per outfield fly ball instead of the regular FanGraphs definition, which is home runs per all fly balls (including popups).

So let’s now take a look at the results. First, we’ll look at just the vertical location of pitches and how that relates to various batted ball stats. How about BABIP to start off?

What you’re looking at here is a plot with BABIP plotted on the x-axis and distance from the top of the strike zone — which is located at y=0 — on the y-axis; negative numbers are below the top of the zone and positive numbers are above. The reason I measured the vertical distance in feet above/below the top of the strike zone is because different batters are different heights, and measuring absolute height would be less accurate. I set the axes the way I didto give a better visual representation of what we’re trying to see. (The same plot, just with horizontal distance instead of vertical, is flipped — that I will show later.) The size of the circles represents how many pitches there were in that location in the dataset (the scale is on the right).

There’s a pretty clear relationship between vertical location and BABIP! Nothing groundbreaking, it’s all intuitive and expectable (pitches in the middle of the zone fall for hits more often). I was certainly surprised by the strength of the relationship, although I suppose you shouldn’t be, since I just told you in previous paragraphs that there was a strong relationship. Let’s now look at wOBABIP (which, for the sake of ease and typing, I will call just wOBA throughout this post. No other type of wOBA will be mentioned).

This graph is nearly identical to the BABIP one except for a different scale and a more drastic drop as the pitch gets lower in the zone. Next, let’s look at different batted ball stats – GB%, FB%, LD%, and some others. FB% and GB% come first:

Holy tight relationship! This is much better than BABIP and wOBA. Again, the findings are not so surprising — lower in the zone, you get more grounders; higher in the zone, more fly balls — but the strength of these relationships (albeit the fact that they are not linear) is again surprising.

Another tight relationship! This one was even more encouraging to me when I saw it than the previous two were, because popups are one of the driving factors behind BABIP and have much more influence over it than either outfield fly balls or ground balls do.

These last two are a little weaker, as the circles don’t fit as tightly on the line. That’s to be expected, though: LD% and HR/FB% are notoriously unstable year-to-year, so it would follow that it’s harder to estimate them from secondary factors. However, there is still a clear pattern, and it’s a closer relationship than what might have been expected. This is more encouraging stuff.

Now onto the horizontal distance. As I mentioned before, these graphs have the axes switched from what they were in the previous ones, so realize that when you’re looking at them. The distance here, too, is adjusted, so righties and lefties are on the same scale — a positive value is always a pitch farther outside, and a negative value is always a pitch farther inside — so it’s essentially looking at this from a righty’s perspective. x=0 is the center of the plate; the edges of the strike zone are at x=±.708333.

Much like vertical location, BABIP and wOBA follow similar patterns to each other, and of course, a pitch in the middle of the zone is more likely to fall for a hit — and more likely to fall for an extra-base hit — than a pitch on the edge.

And as with vertical distance, there is an extremely close relationship between horizontal distance and both GB% and FB%…

… as well as with PU%. This was another good sign, and it looked to me at the time like we should be able to predict at least PU%, GB%, and FB% pretty well if nothing else.

And HR/FB% and LD% also show a fairly nice relationship here. This was all good.

But these don’t tell us too much by themselves. They’re interesting, but not too surprising or applicable. To get a better sense for what location does to BABIP and other stats, we need to look at both kinds of distance on the same graph. Since plotting in three dimensions is hard, the way I will display these is with heatmap-style graphs. The axis scales are read the same as the ones in the graph above, only now they are on the same plot; the color of the boxes shows how high or low the statistic in question is at that spot. The white dotted line is a general representation of the strike zone; the zone obviously changes based on the height of the hitter, but this is the average one.

Meh. Actually, there’s not so much of a relationship for us to see, although the model I came up with had a pretty high correlation (north of a .8 r^2). wOBABIP shows a much clearer graph:

There we go! There’s a really visible pattern now. A line extending from the upper-outside corner to the lower-inside corner seems to be where hitters do the most damage — interestingly and maybe not coincidentally coinciding with the line for effective velocity. If you look closely back at the BABIP graph, you can see the same line, only less pronounced. The message is simple: in general, hitters do better with pitches in the middle or low and inside.

These last two are, I think, more fascinating visually than analytically. The smoothness of the graphs here for OFFB% and GB% could have been anticipated because of how high the correlation is for each of vertical and horizontal location.

Nothing here at all. Oh well. The graphs plotting HR/FB% against only one type of location showed promise, but putting the two together shows none.

Line drives follow a similar pattern to that of BABIP and wOBA if you look closely enough. It’s weaker than those two, but it’s there. Which makes sense: line drives are by far the best type of batted ball for hitters, as they land for hits the most.

And, lastly, popups. This one, in the same vein as OFFB% and GB%, follows an extremely tight relationship in each of the individual location types and then also when combining the two. And since popups are the best results a pitcher can hope for if the ball is to be put in play, maybe pitchers should start throwing high and inside more.

How useful is this information? Not terribly useful, unfortunately. There are a few reasons for why the models from this don’t explain pitcher BABIP (or wOBABIP) at all.

First, these graphs are made with samples of thousands of pitches for every location bucket. If the sample is decreased to just a few dozen pitches at most, the variability shoots up and the model becomes worthless.

Second, these graphs are generalized to the population of all hitters. Every hitter is different, and pitchers have to attack them differently. Pitching only in spots where you can expect a low BABIP from the average hitter — like low and outside, for example — might work against some hitters, but not others, and pitching according to the hitter you’re facing is more important than pitching according to what the league average is.

And third, even though a very clear pattern exists for BABIP and wOBA, that pattern covers nearly the entire strike zone, save only some of the corners, and that is where the vast majority of balls in play are hit from. The expected wOBABIP for pitchers isn’t going to vary all that much, really.

I did, however, enjoy making and looking at these graphs. They provide a concrete, quantitative look at how the location of a pitch matters to the ball’s journey off the bat.