Introducing Moss: Metric Over Significant Sample

I will be writing a series of these as I further develop this approach so this first one will be kind of rough but it will give you an idea of what I am attempting to accomplish. In the past year or more I have been contemplating issues with "modern" baseball and believe I have helped add more information with one of this issues: microanalytics aka head to head matchups.

Many times you see fans or members of the media (less so nowadays as knowledge of statistics becomes more common) quote a player’s numbers against a specific pitcher. Unfortunately this information is usually useless due to sample size issues. Basically, someone being 6 for 20 against a guy does not really tell us as much as we want it to. Well what is the solution to this? Grouping or clustering similar pitchers together. When I pitched in college either during the season or playing in summer ball I would come across pitchers who pitched similarly to me and I would ask my teammates how different it was to hit off me or them. Invariably they would say that yeah of course certain pitchers profile similarly whether it is release point, extension, velocity or basically just the generic how the ball looks. This is why I am creating MOSS.

MOSS stands for Metric Over Significant Sample. The goal is simple: group like pitchers together then extrapolate those head to head matchups with certain hitters so instead of saying a hitter is 6 for 20 against a certain guy you can say a hitter is 100 for 400 against this cluster of pitchers. That is much more informative.

Methodology

The methodology of MOSS is relatively simple and for the first run relatively elementary (more on that later). I pulled down a 40,000 pitch sample from the end of last year’s statcast data with the help of Bill Petti’s amazing baseballr package. If you are looking to get into sabermetrics it is a MUST. I selected certain input variables, normalized them and then utilized k-means clustering in order to get my groupings. Qualifier: this is very elementary and there is a lot more to be done so the method may seem rather simplistic at the moment. To choose the amount of clusters I utilized the Elbow method as can be seen below:

As you can see it is tough to really see where the elbow aka the bend is. The actual amount of clusters will be a little trial and error and I will have to utilize WCSS and the Silhouette score to help optimize the clusters. So for this first run I just selected 10 clusters for this 40,000 pitch sample. Update: Right after I wrote this draft I ran a loop to show me the silhouette score for every iteration up to 30 clusters. 2 was the highest which does not help us but 10 had a score near 2 so looks like 10 may have been the best selection for this run. Again more work to be done.

All in all I was pleased with how the initial run did not weigh any variable too heavily. You can see the distribution of the clusters below (reminder: Python is zero-indexed so the highest cluster number is 9 but it starts at 0):

The smallest cluster was 452 which is just over 1% and then biggest cluster was 20%. I would ideally like these to be a little more evenly distributed so that is something I will look into however this may simply be the case.

There are two graphs below that show the distribution of release point on the x axis and release speed as well. My goal was to have a pretty even distribution for each cluster across these two variables otherwise you could simply look at release point or velocity as an indicator of hitter effectiveness against a certain grouping of pitchers which would defeat the entire purpose of this analysis.

So as you can see each cluster has a relatively even distribution across each of these input variables which I am happy with….for now.

Future State aka "To Do"

The next steps that need to be taken are rather predictable. I have amassed a database of about 7.7M pitches since 2008 (again, thank you Bill Petti). After this relatively successful first run I will be using that sample to take to the next steps. Additionally, I will be using the aforementioned techniques to further optimize this algorithm as well as re-running the elbow method. There will be some more feature engineering and perhaps variable selection. I will also be using these clusters to back test actual performance vs predicted performance. Also this analysis was done at pitch level so there will be some weighting. A pitcher can be 70% in one cluster, 20% in another and 10% in another so in a batter vs pitcher scenario you would have to take that weighting into account. I will be calculating mOBA (what a hitter did and is expected to do against a certain cluster then hopefully that predicts individual matchups). The idea is to use this an aid when and when not to pinch hit, when to rest a player and when to give a player that spot start.

The biggest thing I wanted to do is introduce this work and get any feedback on it. I believe this is the next step in sabermetrics: microanalytics. This coupled with my belief that with all of the shifting we need to re-evaluate the construction of certain positions. I firmly believe teams will be able to "hide" pitchers in certain positions and that certain positions should be profiled differently. Range is not nearly as important as it once was in my opinion. I also believe that through the lens of AI using CNNs you will be able to diagnose fatigue of a pitcher much more effectively as their mechanics possibly break down. Those are just a few other ideas.

To sum up this approach performs admirably but there is a lot more to be done. Many times in industry my final product bears only a scant resemblance to my final product which most likely will be the case here as there is some feature engineering to do but the process is set. I will be updating progress on this each weekend and ideally have it ready to go by beginning of the season.