Although the minutiae of baseball is frequently analyzed in painstaking detail, at some level it remains a simple game. When teaching the game to kids (or in my case, some European and Asian friends who have never seen the game before), we likely begin with an explanation like this: “The pitcher throws the ball to the batter, and he wants the batter to swing and miss at the pitch. The batter, meanwhile, wants to swing and make contact with the pitch.”

That’s really the essence of the game at its most basic level. The batter wants to hit; the pitcher wants him not to hit. But even beginning with such a simple premise, it’s already possible to ask questions that require considerable rigor to answer accurately. For example, a question like: On any given offering from a pitcher to a batter, how likely is the pitcher in question to record a swinging strike?

It’s that simple question — one that characterizes the most basic element of the game — that I’d like to answer in what follows. To do so, I’ll use PITCHf/x data from the past few seasons, and employ two different models: logistic regression and random forests.

If you’re reading this article, you’re likely familiar with the individual PITCHf/x statistics, so I won’t explain them here. Note that in all the PITCHf/x data we are analyzing, we will consider only pitches on which the batter offered at the pitch, and we are not considering the count in which the pitch is thrown. Regarding the logistic regression and random forests models, don’t worry if you’re not particularly familiar with them. All you need to know for now is that both of these models take continuous inputs (such as pitch velocity) and categorical inputs (such as pitch type), and then output the probability estimate between 0.0 and 1.0 that a particular binary event occurred (in this case, whether the batter missed or not). For more detailed information, I invite you to read the Wikipedia articles that are referenced.

Why is this exercise useful? For one thing, strikeouts are very important for the pitcher, and 75 percent of strikeouts occur on swings and misses (as opposed to a called third strike). Strikeouts are one of the statistics that the pitcher controls (almost) entirely, and represent an important component of Fielding Independent Pitching (FIP) metrics. Conversely, avoiding swings and misses is very important for the batter. Once the batter puts the ball in play, anything can happen; Batting Average on Balls in Play (or BABIP) demonstrates that once a batter puts a ball in play, whether he gets a hit is highly dependent on luck.

Initial single-model logistic regression

Let’s start with the logistic regression model. We’ll randomly partition our data (about 1.2 million pitches) into an 80 percent training set and a 20 percent verification set. The simplest possible operation is to create a single model out of all of the pitches in the training set, and then use that single model to predict the outcome of all the pitches in the test set. Due to the effects of this random partitioning, it’s wise to repeat the calculations several times using different random partitions, and average the results together (although with a large data set of 1.2 million pitches, the final result is unlikely to change very much).

We will use the convention of “success” and “failure” from the pitcher’s point of view. This means that a “success” is the batter missing, while a “failure” is the batter making contact. Note that we could have just as easily defined “success” and “failure” from the batter’s point of view, and we would have minimal changes in the problem and code. Since the model returns an output probability between 0 and 1, we will declare that the batter missed if the output probability is greater than or equal to 0.5, and that the batter made contact if the probability is less than 0.5. By using these steps, we get an accuracy of 81.0 percent.

Result evaluation methodology

Not bad, perhaps? Surely, 81.0 percent is better than if we just flipped a coin to determine the result. But we have to look at the context of the situation. How often does a batter swing and miss? If you watch baseball regularly, then you know that most of the time the batter makes contact. In our data set, the batter made contact 79.6 percent of the time. So if we did no analysis at all, and just always guessed that the batter made contact (which is a terrible prediction model), we would be right 79.6 percent of the time. 81.0 percent accuracy for our model doesn’t sound quite as good now.

It is important to realize that there are actually two types of errors that we can make:

Predicting “missed” when the batter actually made contact

Predicting “made contact” when the batter actually missed

Since we are calculating the probability that the batter missed, if we correctly predict a miss, that is a “true positive,” and if we correctly predict contact, that is a “true negative.” Conversely, incorrectly predicting a miss (first bullet point) is a “false positive,” and incorrectly predicting contact (second bullet point) is a “false negative.”

Previously, we used a probability threshold of 0.5 to decide between “missed” and “made contact.” While that is often a convenient choice, it is definitely not the only choice possible, as we can choose any threshold between 0.0 and 1.0. If the data set is uneven like it is here (where the presence of one event is much more likely than the other), 0.5 will often be a poor choice. For each threshold, we can compute the true positive rate and false positive rate. We can put these pieces together in the Receiver operating characteristic (ROC) curve. This handy curve plots the true positive rate against the false positive rate. Here is the ROC curve for our first model:

Note that this is a monotonically increasing function. This means that as you increase the true positive rate (which is desired), the false positive rate will also increase (which is not desired). You have to make a compromise somewhere. According to this graph, if you want a true positive rate of 50 percent, then you will have a false positive rate of 20 percent. Note that if we always guessed “made contact” (a threshold of 1.0), we would be in the lower-left point, where true positive rate and false positive rate are both 0.0, and if we always guessed “missed” (a threshold of 0.0), we would be in the upper-right point, where true positive rate and false positive rate are both 1.0.

A common metric used to evaluate ROC curves is the area under the curve (AUC). This curve has an AUC of 0.688. Note that the AUC will range from between 0.5 and 1.0; a perfect classifier will have an “L” shaped curve with an AUC of 1.0, and a curve generated by random guessing will be a straight diagonal line with an AUC of 0.5.

Improving logistic regression:

We should be able to do better than creating a single model for all the training data. Let’s partition the data up into smaller sets, each with similar attributes, and then build models out of each individual set. After some experimentation, I decided to partition among these criteria:

A Hardball Times Update by Rachael McDaniel Goodbye for now.

Pitch type

Batter’s handedness

Pitcher’s handedness

Region of plate crossing (5 horizontal zones and 5 vertical zones)

A model was created if we had at least 50 pitches of data with the appropriate criteria. A sample model would be on pitches that were “fastballs from a righty pitcher to a lefty batter in the upper-left square of the 5×5 plate region”. More generic models were also created that did not use region of plate crossing, and these were used in the prediction stage if we didn’t have at least 50 pitches of data for a particular set of criteria.

When using this set of models for prediction, with a probability threshold of 0.5, we get a total accuracy of 82.7 percent, which is somewhat better than the 81.0 percent of the previous result. Here is a plot of the new ROC curve:

This ROC curve has an AUC of 0.774, which is nice improvement over the 0.688 of the previous ROC curve. Visually, you can see that the “hump” of the curve carries out further towards the upper-left corner, which means that you get a better true positive rate at the same false positive rate.

Picking an operating point on the ROC

Now that we have a nice ROC, let’s try to pick an operating point on this curve. The probability threshold will uniquely define a point on this curve. Recall that a threshold of 1.0 leads to the lower-left hand corner of the ROC, while a threshold of 0.0 leads to the upper-right hand corner of the ROC. While there is a consistent relationship between the threshold and the TPR/FPR (specifically, lowering the threshold will always move you towards the right-hand side of the this graph), it is far from a linear relationship. With the previous threshold of 0.5, we would operating at the point where FPR is 0.0353 and TPR is 0.278. This point is actually quite close to the lower-left hand corner.

There are many criteria that can be used to select an operating point, and they all have advantages and disadvantages. One criterion is Youden’s Index. Graphically, this index is the vertical difference between the ROC curve and the diagonal “random guessing” line. Intuitively, this is the “improvement” you get at a certain FPR by using a particular classifier compared to random guessing. In this ROC, this point with the maximum Youden’s Index is the one where FPR is 0.258 and TPR is 0.664, which yields an index value of 0.406. This corresponds to using a threshold of 0.207. Here is a new plot, with the 0.5 threshold point denoted with a red circle, the Youden’s Index optimal point denoted with a green circle, and the magnitude of the maximum Youden’s Index denoted with a green vertical line:

It turns out that the accuracy of using the Youden’s Index optimal point is 72.5 percent, which is lower than the previous accuracy, and even lower than just declaring “made contact” on all pitches. That being said, Youden’s Index (and also other types of criteria) is not designed to maximize accuracy, and as mentioned earlier, each criteria will have its own advantages and disadvantages. If your criterion is to maximize accuracy, then the point you would pick would be different from both the red circle and green circle. When analyzing performance, you often look at several different criteria.

Using random forests

Now let’s switch to random forests. Let’s start simple, and create a random forest using all the training data without any type of partitioning. Note that each individual decision tree in the random forest is a separate model, so we are actually creating an ensemble of models, as opposed to the logistic regression case, where we had only a single model. We’ll use a random forest of 300 trees. We get an accuracy of 82.3% (with a 0.5 threshold), an AUC of 0.757, and a maximum Youden’s Index value of 0.384, which is quite a bit better better than the initial logistic regression result:

Now let’s partition the data in the same manner as before. This yields an accuracy of 82.3 percent (with a 0.5 threshold), an AUC of 0.758, and a maximum Youden’s Index value of 0.379. This is no better than the initial random forests result, and is slightly worse than the partitioned logistic regression result:

Partitioning often doesn’t help with random forests, because the random forest algorithm already captures conditionality that other algorithms don’t. In general, one algorithm is not always going to outperform another algorithm consistently, and it is up to the judgment of the analyst to determine which algorithm to use and how to structure the data for training.

I also want to point out that it was relatively easy to switch to using random forests after finishing the logistic regression code implementation in R. It was a one-line change in each place where I generated a model, a one-line change in each place where I processed the prediction, and a few lines where I tallied the results (because the two algorithms return their results in a slightly different format). Note that in both algorithms, the code to create a model and to process the prediction were one-line calls to a library function. Some very smart people spent a lot of time writing these library functions so that analysts like you and me wouldn’t have to implement it ourselves; it is up to us to have the knowledge and wisdom to know how to harness the power of these library functions.

Optimizing calculations

Let’s backtrack a step, and revisit our decision on the number of trees to create. In short, the random forest algorithm randomly creates a certain number of decision trees, each of which determines success or failure, and the final probability estimate is the percentage of trees that votes for success. In using 300 trees (which was basically a guess for a starting point), the algorithm took a much longer processing time in the training phase than the logistic regression algorithm. Could we have done as good a job with using fewer than 300 trees? Fortunately, R’s random forest library provides a handy function to answer that question. Here is a plot of the error vs. number of trees for the initial unpartitioned random forests case:

The black line is the “out-of-bag” error rate (which is a fancy term for overall error estimate), the red line is the error rate for predicting contact, and the green line is the error rate for predicting swings and misses. All three of these lines start plateauing after about 50 trees, which suggests that we probably could have gotten about the same accuracy by using fewer than 300 trees.

We are most concerned with getting a low OOB error rate. Increasing the number of trees helps us achieve that. However, note that increasing the number of trees will increase the green “swings and misses” line. This phenomenon is common in unbalanced data sets such as this one. Also note that OOB error is not exactly the same error as the error associated with the accuracy metric we previously used, but in general it can be treated in the same manner.

If we wanted to see how the number of trees impacts Youden’s index values or other criteria, that analysis can be done separately.

The most important PITCHf/x variables

So far, we have used several different models to predict swinging strikes, but we haven’t explicitly identified which variables are the most important PITCHf/x statistics that most strongly correlate with swinging strikes. R also provides a convenient function to answer this question. Here is plot resulting from that analysis on the initial unpartitioned random forests model:

The two graphs show the top seven most important variables according to the accuracy and Gini impurity metrics. Similarly to the situation we had in picking an operating point on the ROC, there is no single definition of an “optimal” classification tree, let alone an “optimal” random forest. A random forest that is “accurate” (meaning it includes the top variables for reducing classification error) may not be good according to other metrics. Many data scientists prefer using the Gini impurity metric over accuracy, for reasons that are out of the scope of this article.

That being said, the top two variables in both cases are Plate_Z (vertical location of the pitch as it crosses the front of the plate) and Plate_X (horizontal location of pitch as it crosses the front of the plate). Intuitively, this makes a lot of sense; batters will often make contact with pitches in the middle of the plate, and they will often miss on pitches way out of the strike zone. However, you should not be misled and believe that a good strategy for the pitcher is to always throw pitches way out of the strike zone, since batters are probably very unlikely to switch at those pitches.

It should be straightforward to do similar analysis for the logistic regression models.

Further improvements

Here are some things we could try that could make our model better:

Partition on individual pitchers and/or batters – Of course, some pitchers will be better at inducing swinging strikes, and some batters will be better at avoiding swinging strikes. In the above analysis, while individual player data was used in the model, it turned out not to be a significant factor. If we partitioned on players, we would have to be careful with players with few pitches of data (perhaps to handle that case, we would fall back to a more generic model).

Weigh errors – We are treating the two types of errors similarly. We could weigh the errors to penalize one type of error more heavily.

Balance the training set – This would also be done using weights.

Try different random forest parameters – We used all the default parameters. It could be useful to try different minimum leaf sizes, for example.

Include count – We did not include the count in developing the models. With less than two strikes, a foul ball is as good as a swinging strike for the pitcher (or perhaps better, when you factor that it could be caught by a fielder for an out). Also, the mindset of the batter changes with two strikes, as he will try to protect the plate. By going in this direction, we would be changing the problem from “predicting swinging strikes” to “predicting generation of strikes.”

Conclusion

Predicting swinging strikes is no easy matter, but it is an important task to do. In the modern sabermetric landscape, where every last detail is analyzed, we should not forget about the fundamental concepts. Pitchers who generate a higher percentage of swinging strikes are more valuable, and likewise, batters who make contact on a higher percentage of their swings are also more valuable. A good GM would be wise to stock up on these players, in order to maximize his team’s chances of winning the games.

With the analysis that we’ve done, we should be proud of the results that we’ve achieved. Our results are much better than simply guessing or using our “gut feel.” We have shown that there are multiple methods to perform and evaluate the calculations, each with its own benefits and drawbacks, and that the most difficult part of the job is not in writing the R code, but rather in planning the analysis and interpreting the results.

I suspect that by adding some of the techniques in the “Further improvements” section, we can do a little better, but I bet that even the best known result is still far from a 100 percent accuracy, a 1.0 AUC, and a 1.0 Youden’s Index. Like all good open problems, we will probably keep discovering better solutions well into the future.