2019 Improvements: New Pitch Identification & Misclassification

As noted above, pitchers are constantly tweaking their repertoires, and unfortunately, they don’t call us and let us know they are going to start throwing a new pitch the next day. We needed a way to catch these changes automatically, so we could reduce the manual work that goes in to spotting them.

To tackle this problem, we use a combination of supervised and unsupervised machine learning methods and some simple business rules to detect three types of discrepancies:

Has the pitcher added a new pitch to his arsenal? Have we defined the pitcher’s arsenal incorrectly (i.e. did we label all sliders as cutters) Is a single pitch labeled incorrectly?

We will cover the first two below.

Pitch Arsenal Detection

We approach 1) and 2) similarly. As we started to test out our method, it surfaced the cutter Marco Gonzalez added at the beginning of 2018. He is a great example of how this identification process has helped us.

Here is his arsenal pre-2018:

Marco Gonzalez arsenal pre-2018

Supervised Methods

Our first goal was to build a classifier that would be able to classify pitches on a general league-wide level since these new or mislabeled pitch types are not encoded in our player level neural network. After exploring various methods, we used gradient boosting, specifically XGBoost.

We use various tracking metrics to help classify pitches including horizontal/vertical break, velocity, spin rate, spin axis and where the ball is released.

Using these tracking metrics alone gets us most of the way there, but for pitchers that deviate from the norm, we have a harder time classifying their pitches. For example, the league model might classify a 90 mph changeup from Jordan Hicks as a fastball, but we know he throws a 100+ mph fastball. To try to “personalize” these classification without knowing the pitcher’s identity, we also add scaled versions of these measurements using rolling metrics. We solve the above example by adding a scaled pitch velocity:

Now, given a scaled velocity of about 0.5, it is easier to determine that the 90 mph changeup is in the middle of Hicks’ velocity range, and thus, is probably not a fastball. We apply similar scalings to other metrics to define the periphery of a pitcher’s arsenal.

We perform other data manipulation tricks such as normalizing all horizontal break and spin axis values to be right-handed. There is no need for the model to have to learn the differences between L/R handed pitchers if we already can encode that ourselves.

Many pitch types can be paired up since they are mostly just naming conventions (sinker/two-seam fastballs, knuckle curve/curve, splitter/change-up), so we group these pitch types together into one classification.

Now back to our example. After Gonzalez’s first start in 2018 we see that his arsenal looks a bit different. Note the small cluster of new pitches between 85 and 90 mph with little horizontal break and similar vertical break to his fastball.

Marco Gonzalez pitch distribution after first 2018 start

At this point in the process we have no assumptions about the pitch types in his arsenal. We take the set of unknown pitches and classify them with our generic classifier:

Marco Gonzalez’s classified arsenal using MLB league classifier

Unsupervised Methods

A classifier isn’t going to solve the problem by itself. The information we gained with the classifier only tells us that there might be 15 or so cutters mixed into a list of his last 750 pitches. We aren’t aware that these pitches form their own cluster without manually looking at it. It could be that his fastball has more glove side movement than most pitchers’ and we are mistakenly classifying that subset of fastballs.

Here is where unsupervised learning can help. To determine how many pitches he has, we use gaussian mixture models (GMM) to cluster the pitches in the same 3D space as the plots shown above. Since this is an automated procedure, we can’t eyeball the plots to choose a good number of clusters (k). Selecting k is sometimes rather ambiguous, and there exist many approaches. We simply iterate through a range of pitch arsenal sizes and pick the best one based on BIC.

Applying this process to Marco Gonzalez’s pitches we get the following clusters:

Clusters identified by GMM in Marco Gonzalez’s pitches

Ok, Gonzalez has 4 pitches, now what? We take these cluster assignments, align them with the classification probabilities from our supervised model, and sum up the likelihoods in each cluster.

Pitch type likelihoods by cluster

Each cluster now has a probability distribution over all pitches, and we can simply assign the pitch with the maximum likelihood to that cluster (i.e. pitch 3 = FC, pitch 2 = CH / FS etc). In this case, the first cluster is slightly ambiguous. If you have ever used these types of methods, you know that they don’t work out this perfectly every time. Therefore, we implement some business rules to determine if we should add clusters and help our recall of new or mislabeled pitches:

If a pitch makes up at least 40% of a cluster, add a cluster and assign it that pitch type If the total probability of a pitch across clusters sums to at least 50%, add a cluster and assign it that pitch type.

Finally, we compare the current arsenal to our estimated arsenal and send an alert to prompt re-training if there are any differences.

Technology & Automation

We code this process into Python modules and package it into an Airflow job. The job compiles reports and sends an email daily to alert us of any detected changes or errors. We have evolved them to include links to generated plots (like above) and options to ignore alerts for specific pitchers, which is very useful for guys like José Berríos whose pitch movement doesn’t quite match their name for it.