In 2015 and 2016, Statcast had a well-publicized problem with missing data, but now in 2017 every batted ball has exit velocity and launch angle information. So that means the problem was solved, right? Wrong. The problem persists. Around 11.6 percent of batted balls on Baseball Savant have their exit velocity and launch angle decided by an algorithm, not a TrackMan measurement. The consequences of this algorithm may color your understanding of Statcast and which techniques you will want to avoid when analyzing the league or its players.

How Much Of The Data Is Missing?

Statcast had a rocky first year and failed to report 20-30 percent of batted balls in 2015. There were many reasons for this, but long story short, it wasn’t necessarily a technical limitation because the data was retroactively filled in after the season ended. Today, having logged three full seasons, Statcast appears to be reporting data for about 88 to 89 percent of the batted balls. Later on in this piece I’ll explain how I am identifying missing data, but for now it is important to point out the types of batted balls that are missing from the dataset.

Stringer Totals Type Total Missing % Missing Ground Ball 181,017 29,047 16.0% Line Drive 101,000 1,860 1.8% Fly Ball 82,973 2,541 3.1% Pop Up 27,282 12,328 45.2% Total 392,272 45,776 11.7% SOURCE: Statcast

As it turns out, TrackMan has trouble detecting balls that move perpendicularly to the face of the radar. That is, straight up, down, left (third base side) or right (first base side). The more the movement is directed away from the radar; i.e., into the field of play, the better it is at detecting and accurately measuring the velocity of the ball.

Luckily, baseball has foul territory, so many of the balls that move left or right can be ignored as foul balls. Unfortunately, balls hit down into the ground and straight up into the air are fairly common. This means that not only does TrackMan have a bias against certain types of batted balls, but these batted balls are generally very weakly hit ground balls and pop-ups.

In addition to missing individual batted balls, nine games appear to be missing all their TrackMan data. July 24 and 25, 2015 in San Francisco; July 3, 2016 in Fort Bragg, N.C.; Sept. 23 through 27, 2016 in Pittsburgh; and Aug. 20, 2017 in Williamsport, Pa..

The missing data in Fort Bragg and Williamsport shouldn’t be a surprise, considering neither of those stadiums has TrackMan installed. Well, I’m not sure Fort Bragg should even be considered a stadium, but that is neither here nor there. Missing five consecutive games in Pittsburgh is a bit odd, but I suppose hardware problems can arise from time to time. It is possible there are missing games that I have overlooked, but it is nice to see that the number appears to be minimized.

All in all, though, TrackMan has covered 7,385 out of 7,394 major league games in the past three years. These nine missing games had 726 out of the 561,666 plate appearances that occurred. There are so few of these missing games that we can effectively ignore them going forward, although it is important to acknowledge they exist, especially in 2016, which had six missing games.

However, you cannot explain all of the missing data using only weakly hit balls and missing games. There are a large number of missing line drives and fly balls as well. There may be some fraction of batted balls that are missed by random chance, or perhaps there is another mechanism for missing batted balls that I don’t quite understand.

Either way, you can generally assume that a great many of the missing data points are weakly hit. So much so that if you were to calculate average exit velocity and launch angle using the balls measured by TrackMan. you would end up with inflated numbers. Both the exit velocity and launch angle would be too high, since you would be throwing out an enormous number of weakly hit ground balls. Pop-ups, too, but mostly ground balls.

The Toolbox

This is a problem. We want to accurately measure exit velocity and launch angle, but to do so we need to account for these missing batted balls. Before we can address a solution we need to address the tools we have to work with.

When TrackMan data is missing we lose out on:

Exit velocity

Launch angle

Batted ball distance

Batted ball spin

It is important to understand that batted ball distance is lost. If you had batted ball distance, you might be able to reverse engineer an exit velocity using the batted ball type and fielding location data. Alas, we are left without the distance data.

We are left with:

Batted ball type

Batted ball result

Which players fielded the ball

Rough approximation of where the ball was fielded (hc_x, hc_y)

The first three items on this list are called the stringer information. The hc_x and hx_y coordinates are very rough estimates, and are not especially reliable (although they are better than nothing when you’re left with no other choice).

Last year, Jeff Zimmerman developed a method in which he found the average launch angle and exit velocity for batted balls fielded by each position. So, for example, a pop-up to the second basemen versus a fly ball to the right fielder, etc. In this way he used all three aspects of the stringer information to estimate the batted ball quality.

I was working on a system of grouping balls using the hc_x and hc_y coordinates along with batted ball type. Before I had a chance to finish this project, MLB announced it would be filling in the missing data on its own. Since then, MLB has retroactively filled in data for 2015 and 2016 and provided data for the 2017 season. Once MLB implemented its solution, accompanied by Tom Tango’s article, I shifted my focus to other aspects of the game. The missing data problem went out of sight, out of mind. But I believe the missing data is still an issue that needs to be addressed.

The “No Nulls” Solution

As I have already stated, there are two types of missing batted ball data. Missing games, and missing batted balls. It comes down to an issue of measurement bias versus failure of measurement.

In the cases where the game is recorded, but an individual batted ball does not register, you can assume that the majority of the time (70-85 percent) the ball was poorly hit. Therefore the batted ball likely has a weaker than average launch angle and exit velocity.

When the game is entirely missed, you don’t have any information about the batted ball, so you cannot make any assumptions.

The data you see on Baseball Savant is going through a multi-step process to correct for this missing data. First, it is checking to see whether the game is entirely missed. If the game is missed, it gives an average launch angle and exit velocity for each stringer batted ball type. So, for example a groundball single will have an exit velocity of 93 mph and a launch angle of -3, whereas a flyball double will have an exit velocity of 93 mph and a launch angle of 32 degrees.

If the game is recorded, but an individual batted ball is missing, it will assign a second set of values based upon the stringer data. These are, by far, the most common data corrections you see for batted balls with missing data. For example, a ground out has an exit velocity of 83 mph and a -21 degree launch angle and a pop out is 80 mph, with a 69 degrees launch angle.

This is a simplistic understanding. Under further analysis, there appear to be multiple launch angle and exit velocity combinations for the various stringer types, even in missing games. So, the exact rules for how these numbers are distributed are a bit complicated and remain unknown to me. Presumably, you could reverse engineer them if you were so inclined.

Mountains of Balls

Since MLB is using a short list of rules to distribute exit velocity and launch angle information to the batted balls with missing data, we can reverse engineer the rules using the stringer data and frequency information.

Generally speaking, combinations of exit velocity (which has one decimal place) and launch angle (three decimals) are pretty random. There are so many possible combinations of launch angles and exit velocities that you wouldn’t expect many results for each given pair of numbers (for example 84.5 and 23.457). When you include the stringer tags, the number drops even further. In fact, there are only 75 combinations of exit velocity, launch angle, and stringer tag with five or more matches. Compared to 346,956 with fewer than five matches, 99.8 percent of which are unique.

I have taken these 75 triplets and named them the most likely candidates for MLB’s “No Nulls” rules. It is possible that a few of these are not actually part of the “No Nulls” set, and it is possible there are a few combinations of “No Nulls” that are so rare that they haven’t yet occurred five times. For example, a pop-up triple.

I am fairly confident that nearly all of the batted balls identified in this manner are No Nulls, and at the end of this piece I will include several tables that include all of the candidate groups of balls. For now, look at these two images, which show the distribution of No Nulls in the dataset. Click the images to make them larger.

The dark red bars show the batted balls measured by TrackMan, and the light red show the balls added using the No Nulls rules. Look at the enormous concentrations of balls around -24 to -20 degrees and 68 to 72 degrees. These are the majority of your 29,000 missing ground balls and 12,000 missing pop-ups. On the exit velocity chart you can see these balls between 80 and 84 mph.

There is a clear gap in frequencies of batted balls hit between 85 and 90 mph. It seems like there is a good chance the missing batted balls might fill in this gap. With the No Nulls solution, MLB has more than filled in this gap. Realistically, you’d probably expect these batted balls to be spread out more, following a more gradual curve. The exact shape of that curve is unknown, but the balls measured by TrackMan give us a good idea of what it might look like.

The MLB No Nulls solution has created very large spikes in the frequencies of very specific batted balls, but this gets even more messy when you start comparing different seasons. In the GIF below you will see vertical launch angle along the Y axis and exit velocity along the X axis. Each cell represents 2 mph by 2 degrees. The colors represent frequency as a percentile of the largest cell. Green cells are high frequency, and blue cells are low frequency. Click the image to make it larger.

This GIF shows the No Nulls frequency problem better than anything else I have seen to date. Do you see those cells that remain dark green in each season? Those are the missing batted balls filled in by MLB. Some of these cells mysteriously disappear in 2017. Can you guess why? I’ll give you a hint, I told you the answer up above.

The frequencies of various exit velocity and launch angle combinations are clearly changing over time. The difference between 2015 and 2017 is particularly stark. In 2017 the high exit velocity balls appear to be shifting up in launch angle, and the low launch angle balls appear to be moving down in exit velocity. You can also see growing frequencies of pop-ups. The suppressed number of TrackMan recorded pop-ups in 2015 may have been a technical limitation, perhaps. Maybe. I have no evidence for that, but it could be the case, considering how 2016 and 2017 suddenly have balls measured above 80 degrees, although it is clear that the number of pop-ups is definitely increasing each season.

However, with all of these changes, those sticky No Nulls cells are constant.

If the No Nulls are remaining constant from year to year, and there is a clear league-wide migration of batted balls, wouldn’t that mean the No Nulls will get increasingly less accurate over time? If we assume these changing trends are constant, anyway. Perhaps in 2018 some of this will reverse course, batters will start hitting lower launch angle hard-hit balls, and ground balls will increase in velocity. Maybe. That seems unlikely. It is more likely that players will hit even more balls into the air, in an effort to maximize the value of each plate appearance.

Change Is In The Air

When MLB instituted this No Nulls solution, it did so to correct the average launch angle and exit velocity data, both for the major leagues and and for individual players. However, if the average distributions keep changing and the No Nulls aren’t updated to match, these No Nulls may end up overcompensating and hurting the data. For example, in the table below I have put the average exit velocity for all balls hit below -20 degrees for each of the three seasons. Notice how it is dropping with each season.

Below -20 Degrees Year Exit Velocity 2015 74.03 2016 73.82 2017 69.16 SOURCE: Baseball Savant Discounting Nulls

The difference between 2017 and 2016 is dramatic. Many of the No Nulls ground balls fall into this category — about 25,000 of them. The vast majority of these No Nulls ground balls are listed with an exit velocity of 83 mph. That seems a bit high, in light of the sudden drop in groundball exit velocity in 2017. Perhaps I am wrong. Maybe TrackMan happened to record more of the softly hit ground balls and missed all of the hard hit ones. It’s certainly a possibility.

If the batters are, in fact, producing weaker contact on ground balls and the No Nulls solution hasn’t accounted for this, then the average exit velocity may actually be lower than what you see on Baseball Savant. Perhaps MLB is doing something with the No Nulls to keep up with these trends. I can’t see that it has, and I think the above GIF speaks for itself. These clusters of batted balls haven’t changed, even though the landscape around them has.

What To Do Going Forward

When you are looking at major league average launch angle and exit velocity, you should use the No Nulls solution put forward by MLB. You should understand that these numbers are estimates, and even as estimates they appear to have a minor flaw. The league-average launch angle probably will not be off by much, but exit velocity could be off by as much as 1 mph.

The league averages aren’t a big concern, though. Nor are the player averages. Rather, you must be careful when you examine the league-wide results when bucketing balls based upon their launch angle, exit velocity, or batted ball type, particularly balls hit between -30 and -20 degrees or above 60 degrees. Bucketing these batted balls will subject you to the double whammy of both being artificially inflated in frequency and exit velocity due to the No Nulls solution.

Whenever you are bucketing batted balls, you will want to first remove the No Nulls balls. I have created four tables consisting of the No Nulls categories I have identified. You can use these definitions to remove these balls from your own research, if you deem it necessary.

The No Nulls solution implemented by MLB could be better, but it is difficult to criticize without knowing the exact rules that are being used to classify each batted ball. Clearly the balls could be smoothed out more, perhaps using fielder location or some other metric. It appears that the rules are being applied across all three seasons evenly, but perhaps they should be tailored to each individual season. But, again, I don’t know how MLB is assigning the balls, so perhaps fielder location and seasonal variations are already included to some extent.

Ultimately, though, there is only one true solution to the problem, and that is TrackMan recording 100 percent of the data. It goes without saying that this is one of the highest priorities. Any attempt we as analysts make to manipulate the data after the fact will leave us wanting for more.

References & Resources

Appendix: No Nulls Definitions

Stringer Ground Balls Stringer EV Vertical Angle Sample Ground Ball Double 90 -13 124 Ground Ball Double 90.2 -13 92 Ground Ball Double 93 -1 12 Ground Ball Error 43 -62 57 Ground Ball Error 84 -20 462 Ground Ball Error 86 -11 15 Ground Ball Single 40 -36 988 Ground Ball Single 90 -17 2340 Ground Ball Single 90.3 -17.3 1128 Ground Ball Single 93 -3 149 Ground Ball Triple 94 -12 6 Ground Ball Triple 94.3 -12.1 11 Ground Out 41 -39 3777 Ground Out 82.9 -20.699 6083 Ground Out 83 -21 13296 Ground Out 84 -13 507 Total Ground Balls 76.97 -23.07 29047 SOURCE: Statcast

Stringer Line Drives Stringer EV Vertical Angle Sample Line Drive Double 98.8 17.1 96 Line Drive Double 99 17 244 Line Drive Home Run 104 24 76 Line Drive Home Run 104.4 23.699 21 Line Drive Single 41 16 5 Line Drive Single 90 15 564 Line Drive Single 90.4 14.6 135 Line Drive Triple 98.4 18 11 Line Drive Triple 99 18 31 Line Out 37 31 37 Line Out 91 18 512 Line Out 91.1 18.199 128 Total Line Drives 91.76 17.24 1860 SOURCE: Statcast

Stringer Fly Balls Stringer EV Vertical Angle Sample Fly Ball Home Run 103 30 160 Fly Ball Home Run 102.8 30.199 57 Fly Ball Triple 97 31 11 Fly Ball Double 95 29 18 Fly Ball Double 93.1 32 32 Fly Ball Double 93 32 48 Fly Out 89.2 39.299 594 Fly Out 89 38 213 Fly Out 89 39 1323 Fly Ball Single 73 34 11 Fly Ball Single 71.4 36 19 Fly Ball Single 71 36 55 Total Fly Balls 89.85 37.79 2541 SOURCE: Statcast