Two years ago in a FanGraphs article, we outlined several versions of an algorithm for estimating a pitcher’s release point distance using PITCHf/x data. The goal was to use PITCHf/x to estimate the average distance from home plate each pitcher releases his pitches, conceding that each pitch is going to be released from a slightly different distance. We’d do this by finding the distance from home plate where his pitches were most tightly clustered and label that as the “release point.” This would add a variable third dimension to current analysis of release points, since most is done at 55 feet from home plate, and help identify changes in release point that may not be evident to the eye. We considered the case of home and away separately to see if and how PITCHf/x data from various ballparks affects the estimates.

In this update, we combine the previous versions of the algorithm into a single multistage one and apply it to Stephen Strasburg as a test case, and to the 10 pitchers who threw the most pitches in 2014 (and did not change teams from 2013). The former serves as the easiest way to explain the new algorithm, as Strasburg’s data present a challenge due to their spatial distribution and also highlight some capabilities of the algorithm itself.

Test Case: Stephen Strasburg

We begin with a heat map of Strasburg’s pitch locations from his 2014 home data between 50 and 60 feet, normalized by the most dense location over this interval, to get an idea of its distribution:

There appear to be two disjointed clusters, roughly a foot apart from center to center, indicating that either Strasburg switched the horizontal location (i.e., the x-variable in PITCHf/x) of his release point once in 2014 or used two different horizontal locations for his release point over the year. The clusters appear approximately equidistant across this 10-foot interval, with the one on the right being more densely populated. The goal is to split these two clusters into separate data sets and then find where each is most compact, which will be taken to be the point of origin for the pitches. Since there are two clusters, this also provides a means of direct comparison between the two y-values, or estimated distances from home, of the release point.

The algorithm begins with a simple temporal splitting method, similar to signal processing, to sort the data in the order they were generated. This method works with small groups of consecutive pitches, finds their average location, and looks for when consecutive disjointed groups are more than some prescribed distance apart. If this occurs, the data set is separated at the point where the average locations are farthest apart and the pitches on either side are each considered a separate data set in the next stage of the algorithm.

We consider the PITCHf/x data at 55 feet (the default starting point since it is expensive computationally to run this entire process over each y-location considered) and, using 75-pitch groupings, look to see if successive groupings of this size differ by more than 0.75 feet. This is accomplished using a rolling average: find the average location in x (horizontal location) and z (vertical location) of pitches 1-75, 2-76, …, and then compare the locations of pitches 1-75 and 76-150, 2-76 and 77-151, … to see if they identify as disjointed clusters. The choices of 75 pitches and 0.75 feet of separation were chosen as a reasonable lower bound for the number of pitches a starter may throw in a game and a distance that would generally lead to clusters with little to no intersection (taking it much smaller than 0.75 feet, say around 0.4 feet, works well in some cases but will generally produce several clusters with large areas of overlap, which serves to reduce the size of each data set). However, these numbers can be changed depending on the specific pitcher to improve the results, as will be seen later. As expected, this produces two clusters:

The above GIF shows all pitches (all blue data) at 55 feet followed by the data split into two sets: the red data representing the first cluster and blue the second. This confirms that Strasburg did shift his release point only once in 2014. We can also plot the aforementioned rolling averages to get an idea of the evolution of his pitches at 55 feet from home:

The first image is the rolling average of all home pitches (continuous black line) with the green square located at the average location of the first 75 pitches and the red the last 75. The presence of a change in release point is typically associated with sections of the rolling average being very linear, whereas when the release point is stationary, it appears more random. The blue square indicates the location of the change in release point location. The second image shows the rolling average over each separate cluster, with the same color scheme as the previous GIF and numbers indicating the order of the clusters in time.

Even though we have not yet found the y-coordinate of the release point, we can still surmise that Strasburg’s release point at home started high, compared to the rest of the season, and dropped over time. Then at home-pitch 1,221 out of 1,770, the estimated pitch where the change occurred, Strasburg moved about a foot to the left and again, over time, his release point fell. While video evidence makes it hard to validate the vertical shift in these observations, we can substantiate the horizontal shift. First, here is GIF from a home start against Atlanta on April 5, 2014:

Note that Strasburg’s right heel is positioned on the left edge of the pitching rubber on the pitch to Jason Heyward. Now, moving to one of his last home starts, also against Atlanta and Heyward, on Sept. 10, 2014:

In this start, Strasburg’s right foot is closer to centered on the rubber, accounting for the horizontal shift to the left in the data (which is from the opposite perspective).

With these two clusters of pitches now separated, a spatial clustering algorithm is run to remove outliers and additional small clusters not identified by the previous phase, and fit the remaining values to a bivariate normal distribution. We choose this distribution under the assumption that the pitcher, within each cluster with respect to time, has a single theoretical release point and every pitch is some slight variation of this point. This would account for the dense population of data in the center that becomes more sparse moving outward. The spatial clustering is also performed at 55 feet, as performing it at all reasonable distances becomes prohibitive in terms of run time and consistency in the data at each distance considered.

The algorithm is incremental in terms of finding clusters and, starting with the most dense area of the data, labels that as the center of the first bivariate cluster. It then goes through and looks for a second center that is separate from the first cluster, and continues until no new independent clusters can be found. This is adapted from an incremental k-means clustering algorithm for n-dimensional data and is optimized here to run for the 2D case. Working only with the largest cluster, the pitches kept are those that have a 0.05 probability or better of belonging to this cluster. In practice, this usually maintains at least 90 percent of the data from the original set.

From this subset of pitches, the sample mean vector and sample covariance matrix for the data are found at distances ranging, inch by inch, from 45 to 65 feet, an extremely wide range of values to ensure that we are finding the global minimum. From the mean vector and covariance matrix, the area that accounts for two standard deviations away from the center of the cluster at that distance is found, which forms an ellipse. This is the metric used for measuring the size of each cluster. The two standard deviations here (as opposed to one or three) have no effect on the location of the minimum but work well to give a general size of the cluster. We then find the distance from home plate at which this area is minimum and label this distance, as well as its associated mean x– and z-values, as the “release point.”

A Hardball Times Update by Rachael McDaniel Goodbye for now.

For Strasburg’s home 2014 data, these clusters appear as:

Both give very similar distances from home plate: 54 feet three inches and 54 feet four inches. Away from Nationals Park, three clusters are found for Strasburg in 2014. Also, for his data from home and away in 2013, a single cluster each is identified:

Stephen Strasburg, 2013-2014

Year H/A x y z Size Sample Percent Area 2014 Home #1 -1.442 54.25 6.061 1220 1219 99.9 0.215 2014 Home #2 -2.35 54.333 6.005 550 549 99.8 0.155 2014 Away #1 -1.47 54.417 6.1 946 937 99 0.246 2014 Away #2 -2.467 54 6.11 305 253 83 0.182 2014 Away #3 -2.355 53.167 5.924 262 217 82.8 0.196 2013 Home -1.326 54.417 6.218 1497 1373 91.7 0.178 2013 Away -1.338 54.333 6.137 1347 1347 100 0.242

The values in the table are the x-, y-, and z-estimates of the release point, the size of the data set after temporal splitting, the sample size of the data after spatial clustering, the percent of the data used after spatial clustering, and the area of the two-standard deviation ellipse, in square feet. Most the of samples use over 90 percent of the data and place the estimate of the distance from home plate of release in the range of 54 feet four inches. The exceptions are two clusters away from home in 2014, which each use only about 80 percent of the pitches. The problem here is that the two clusters were separated in time by shrinking the distance between clusters from 0.75 to 0.4, which was necessary since the data, upon inspection, did not appear to be a single cluster. Also, both have strongly linear rolling averages, so that there really is not a centralized point from which the data is generated, compared to the case for the first cluster, which is more dense around a central location.

Based on experience with the algorithm, when the data do not closely resemble a bivariate normal distribution, more data are discarded, in this case around 20 percent, and the release point estimates get pushed toward home plate. However, since these typically go hand-in-hand, it is easy to discern when the estimate is likely to be poor.

Estimates for Top 10 Pitchers in Pitches Thrown in 2014

To further examine this algorithm, we will run it for the top 10 pitchers in terms of number of pitches thrown in 2014, using only pitchers who did not change teams from 2013, to ensure the home data is standardized to a single ballpark for comparisons with 2013. No adjustments are made to the parameters here from the 75 pitches and 0.75 feet between clusters, such as was done with Strasburg’s 2014 away data, but some cases will be similarly explored later on.

2013-2014 Estimates for Select 2014 Pitches Thrown Leaders

First Last Year H/A x y z Size Sample Percent Area Johnny Cueto 2013 Home -2.499 54.75 5.781 501 495 98.8 0.296 Johnny Cueto 2013 Away -2.353 53.75 5.702 447 438 98 0.339 Johnny Cueto 2014 Home -2.282 55 5.893 1904 1856 97.5 0.327 Johnny Cueto 2014 Away -2.155 54.25 5.779 1724 1715 99.5 0.439 Max Scherzer 2013 Home -3.456 54.917 5.351 1647 1619 98.3 0.412 Max Scherzer 2013 Away -3.359 54.083 5.316 1741 1569 90.1 0.386 Max Scherzer 2014 Home -3.575 54.583 5.375 1543 1540 99.8 0.299 Max Scherzer 2014 Away -3.509 54.25 5.337 2092 2090 99.9 0.361 James Shields 2013 Home -1.102 54.5 6.088 1700 1691 99.5 0.43 James Shields 2013 Away -1.251 54.417 6.019 1957 1855 94.8 0.399 James Shields 2014 Home -1.441 54 5.803 1614 1604 99.4 0.419 James Shields 2014 Away -1.508 54.083 5.892 2016 2013 99.9 0.6 R.A. Dickey 2013 Home -0.718 53.5 5.929 1833 1754 95.7 0.337 R.A. Dickey 2013 Away -0.741 52.5 5.962 1670 1664 99.6 0.449 R.A. Dickey 2014 Home -0.618 53.083 5.88 1750 1686 96.3 0.246 R.A. Dickey 2014 Away -0.724 52.75 5.95 1754 1744 99.4 0.376 Corey Kluber 2013 Home -1.763 54.75 5.681 1115 1000 89.7 0.145 Corey Kluber 2013 Away -1.771 54.333 5.736 1172 860 73.4 0.171 Corey Kluber 2014 Home -2.097 54.583 5.896 1736 1727 99.5 0.284 Corey Kluber 2014 Away -1.952 53.917 5.774 1752 1469 83.8 0.3 A.J. Burnett 2013 Home -1.782 54.417 5.797 1432 1432 100 0.256 A.J. Burnett 2013 Away -1.74 53.833 5.864 1577 1571 99.6 0.336 A.J. Burnett 2014 Home -1.903 54.5 6.029 1734 1701 98.1 0.259 A.J. Burnett 2014 Away -1.857 53.5 5.9 1727 1684 97.5 0.346 Lance Lynn 2013 Home -2.165 54.5 5.53 1655 1639 99 0.27 Lance Lynn 2013 Away -2.167 53.667 5.501 1696 1669 98.4 0.353 Lance Lynn 2014 Home -2.318 54.25 5.558 1894 1894 100 0.239 Lance Lynn 2014 Away -2.153 53.5 5.555 1551 1551 100 0.312 Felix Hernandez 2013 Home -2.024 53.083 6.005 1564 1564 100 0.397 Felix Hernandez 2013 Away -2.106 54.917 6.198 1599 1598 99.9 0.317 Felix Hernandez 2014 Home -2.145 52.917 5.967 1712 1602 93.6 0.329 Felix Hernandez 2014 Away -2.149 54.333 6.16 1717 1716 99.9 0.244 Chris Tillman 2013 Home -2.219 53.917 7.046 1859 1817 97.7 0.277 Chris Tillman 2013 Away -2.098 54.083 6.86 1611 1549 96.2 0.36 Chris Tillman 2014 Home -2.05 54.083 6.964 1822 1773 97.3 0.321 Chris Tillman 2014 Away -2.032 53.197 6.858 1584 1427 90.1 0.356 Justin Verlander 2013 Home -1.99 53.5 6.705 1877 1873 99.8 0.27 Justin Verlander 2013 Away -1.887 53.25 6.717 1804 1799 99.7 0.426 Justin Verlander 2014 Home -1.959 53.75 6.661 1504 1490 99.1 0.233 Justin Verlander 2014 Away -1.857 53.083 6.597 1901 1883 99.1 0.338

Example Case: Lance Lynn

Consider Lance Lynn’s 2014 home data. No additional clusters in time are identified and 100 percent of the data are used after the spatial clustering phase of the algorithm. The area of the ellipse related to two standard deviations ranks among the smallest of the 10 pitchers in the study.

Note that there is a small centralized area where the pitch locations are most dense and as the distance changes, the cluster contracts to the theoretical point of release and then expands.

The data points themselves mirror the shape of the ellipse shown.

The rolling averages stay horizontally within four inches of each other and vary in height by slightly less (around three inches). However, for data that perform well, the release distance estimate can still vary slightly from year-to-year, with the estimate going from 54.5 feet in 2013 (a similarly good data set) to 54.25 in 2014.

Comparison of Release Point Distance Estimates

With these estimates in hand, we will focus on the y-values, or estimated release point distances, and make three comparisons to check consistency of the estimates: 2013 Home vs. 2014 Home, 2013 Home vs. 2013 Away, and 2014 Home vs. 2014 Away. As a metric, we will use the minimal distance that the ordered pair of the two estimates is from matching exactly.

The two sets of home data generate fairly consistent estimate for the 10 pitchers, with an average minimal distance for each point to the blue line, where the estimate would match, of 0.183 feet, or about two inches.

Most of the estimates fall within a reasonable range of 54 to 55 feet. However two outliers appear in R.A. Dickey and Felix Hernandez. After modifying the parameters in the algorithm, we were unable to find a better estimate for the release point for Dickey, so the problem does not appear to be with the parameters. Speculating a bit, that may be due to Dickey throwing a knuckleball. With those pitches being fitted to the very smooth trajectory of the quadratic functions used in the nine-parameter PITCHf/x model, the extension of the data to the release point distance is not as accurate as that of other pitches. As for Hernandez, upon changes to the parameters (shrinking the distance of separation, in feet, from 0.75 to 0.4 in 2013 and 0.3 in 2013), the estimates move slightly farther from home:

Felix Hernandez, 2013-2014

Year H/A x y z Size Sample Percent Area 2013 Home #1 -2.045 53.75 5.986 1169 1132 96.8 0.308 2013 Home #2 -2.133 53.833 6.339 111 108 97.3 0.194 2013 Home #3 -2.068 55 6.012 284 257 90.5 0.199 2014 Home #1 -2.317 53.417 6.2 319 318 99.7 0.209 2014 Home #2 -2.162 54.083 5.936 1393 1379 99 0.243

A more in-depth analysis of this, since the same trend does not show up in Hernandez’s away data, would likely require a study on the release point distances of pitchers at Safeco Field to see if the discrepancy is endemic to the park or relegated to Hernandez himself.

The comparison of the 2013 Home and Away data demonstrates an interesting consequence of the algorithm.

For eight out of the 10 pitchers, the away estimate of the release distance comes closer to home plate than the home estimates, some by as much as a foot. As noted earlier, the algorithm tends to push the release distance closer to home when the data do not appear to be close to a bivariate normal cluster, which is more prevalent with the away data. Possible indicators of this include a low percentage, below 90 percent or so, of the data used after spatial clustering or a large value for the area of the ellipse. If we focus on the Area value to get an idea of how spread out the data are for these two cases, the home clusters average an area of 0.309 square feet and the away 0.354 square feet, so the latter case is more spread out. This is possibly due to variations in the acquisition of PITCHf/x data from different stadiums or minor adjustments made by pitchers from park to park.

This trend holds up when for the 2014 data as well:

Again, eight of 10 away estimates come in below their home counterparts. Both home/away splits have average minimal distances from a perfect match just under half a foot.

Based on the results for this small sample, we can say that it is probably better to work with only the home data when dealing with release points. For the 10 pitchers considered, and for Strasburg, the algorithm appears to work well in most cases to produce consistent estimates between the two years, with even the two outlying estimates being consistent between 2013 and 2014. However, in some instances, adjusting the parameters in the algorithm can serve as a way to better sort the data and find changes in release point when the clusters overlap. This is hard to do with any fixed set of parameters, but working with a single pitcher, the values can usually be fine-tuned to work well.

To demonstrate this, we will focus on a particular case where the data appear to be multiple overlapping clusters and try to adjust the parameters to see if wee can get better results. It will also illustrate some of the difficulties that show up in the data, especially from multiple parks for the away data of a single pitcher. For this extended analysis, we will focus on Corey Kluber in 2013, since both of his percentages of data used are on the low end, at 89.7 percent at home and 73.4 percent away from Progressive Field. Percentages in this range indicate that it is likely that, in time, more clusters can be identified and, in turn, better estimates can be found based on each.

Multiple Clusters: Corey Kluber in 2013

As with Strasburg, we will start with Kluber’s Home 2013 heat map:

Notice that right around 53-55 feet, there’s a small cluster to the right of the main one. This was not picked up in the original algorithm since the required distance for separation was 0.75 feet, but shrinking this to 0.35, we can separate the two.

For the data at 55 feet, a small cluster is present early in the season (red) followed by a larger cluster to the left (blue).

The rolling averages show that Kluber’s release point, examined at 55 feet, dropped early on and then shifted to the left, where it remained for the rest of the season.

Corey Kluber, 2013 Home Results

Year H/A x y z Size Sample Percent Area 2013 Home #1 -1.312 54.917 5.8 105 96 91.4 0.098 2013 Home #2 -1.774 54.75 5.677 1010 975 96.5 0.136

The 105 pitches in the first cluster account for, almost exactly, the 106 pitches thrown in two home starts against Boston and Minnesota in 2013 (since there is some overlap between the clusters, it is reasonable for this to be off by a pitch or two). After that, his release point moved horizontally by about 5.5 inches and dropped 1.5 inches. The larger cluster places his release point distance at home for 2013 at 54 feet nine inches.

The away data present a larger challenge as several more clusters appear, many overlapping.

There appears to be one dominant cluster with two lesser clusters appearing to its right and lower left. To capture all of the clusters, the parameters are modified to identify overlapping clusters (looking for clusters whose 50-pitch rolling averages are 0.3 feet apart or less). With this choice of parameters, five temporal clusters are found (ordered sequentially in time as red, blue, green, orange, and black).

As with the 2013 home results, Kluber’s pitches at 55 feet slid to the left, from the catcher’s perspective, but then made a jump down and left, then back up, and finally back down. Continuing to the spatial clustering phase, the five clusters each produce an estimate of the release point.

Corey Kluber, 2013 Away Results

Year H/A x y z Size Sample Percent Area 2013 Away #1 -1.454 54.5 5.768 329 328 99.7 0.155 2013 Away #2 -1.813 54.75 5.762 493 461 93.5 0.127 2013 Away #3 -2.088 54.5 5.389 93 74 79.6 0.054 2013 Away #4 -1.878 54.667 5.746 166 166 100 0.101 2013 Away #5 -2.088 54 5.564 91 86 94.5 0.076

The first cluster in time of 329 pitches differs by only one pitch from the 330 pitches thrown in four road starts from April 20 to May 15 (based on the data, this and the next cluster appear to be missing a pitch in PITCHf/x, which accounts for the difference). This is followed by 493 pitches which match up approximately with the 494 pitches made in five road starts from May 26 to July 2. The next cluster of 93 pitches aligns with a start at Minnesota on July 20, where Kluber’s release point shifts left and downward. The next two starts of 166 pitches at the White Sox and Royals move back to the area of the second release point, and his season concludes with the release point modifying again for a 91-pitch effort at Minnesota. If we ignore, for the moment, the two starts at Minnesota, Kluber’s release point on the road shows the same pattern as home with a single shift early in the season, which combining the home/away analysis, occurred May 21 versus Detroit.

Why the change exists for the two starts are Minnesota is unclear, as it does not carry over to the home data or any other starts on the road. It may be a related to the PITCHf/x data from Minnesota or solely to Kluber. Much as with Felix Hernandez, finding the cause of this would require further investigation.

Much of the away data for pitchers have similar characteristics with multiple, slightly displaced clusters, as was seen with Strasburg, and so this is the likely cause for the disparity between the y-values of the release point distance between the home and away data for the pitchers considered earlier. However, it is difficult to attribute this to the data being from multiple ballparks or to a pitcher’s release point in fact varying more on the road. Presumably, finding the right mix of parameters would generate consistent estimates for home and away data and alleviate this difficulty.

Discussion

This choice for the algorithm is just one of many possible configurations relative to each stage. The parameters left to the user are, with the values used here in brackets, (i) the number of pitches in the rolling average [75 pitches], (ii) the distance between separate clusters in time [0.75 feet], (iii) the cutoff for including data in the cluster after spatial clustering [0.05 probability of belonging to the largest cluster], and (iv) the initial distance to perform the temporal and spatial clustering [55 feet]. The goal of this is to form a single algorithm that performs well given any data set from PITCHf/x. However, this proves difficult due to the variability in the data from pitcher to pitcher, especially in the away data. As it stands now, the algorithm appears to work well provided the user takes the time to examine the results and reconfigure the parameters as needed, as with the above discussion of Kluber. When such care is taken, the results can become more consistent, with both of Kluber’s largest clusters in 2013 estimating a release distance of 54 feet none inches.

Another difficulty that arises from the data sometimes straying too far from the bivariate normal assumption is that the algorithm finds nearly one-dimensional clusters, which causes problems since the covariance matrix of such a cluster is ill-conditioned. To curtail this, the maximum number of allowable spatial clusters occasionally has to be restricted to stop the algorithm before it reaches this point.

Future work on this may include analysis of the aforementioned Dickey and Hernandez PITCHf/x data sets, as well as new versions of the algorithm that are less hands-on. Also, the y-location of the release point can be compared to pitcher height to get another measure of consistency, operating under the assumption that height correlates with a release point closer to home plate. As a preview, we can test this for the 2014 home data of the 10 pitchers previously analyzed:

The blue line is the best fit line through the data for all 10 pitchers, and the green corresponds to the case where Dickey and Hernandez are omitted. The Pearson correlation coefficients for these cases are -0.286 for the former and -0.738 for the latter, which appears promising provided the estimates of the two outliers can be improved.

Posted below are links to several Pastebin pages which contain the code used for the release point, 2D clustering, and heat map algorithms. All are written in R and use calls to a MySQL database to get the PITCHf/x data. Parts may need to be adapted to match with another database, but feel free to use or modify any of the files posted.

References & Resources