In the multibillion dollar world of sports entertainment, we often think of injuries as being chance events. I set out to see whether the statistical richness of baseball could be mined to identify players at risk of injury. Some baseball pitchers are paid on the order of a million dollars per game, so the consequences of an injury and a subsequent trip to the Disabled List are immense. Although professional players are placed under a high level of medical scrutiny, I reasoned that the information encoded in performance statistics might add a useful leading indicator of injury risk to the medical toolbox.

Baseball is the data scientist’s dream sport, because nearly every aspect of the game is discrete and quantifiable. Even time, which in other sports goes according to the clock, in baseball is defined by innings, outs, and pitches. Even with all this quantification, it was first necessary for me to properly formulate the question. I chose a classical binary classification format: for each player in each game, I would label that game according to whether it preceded an injury for that player (1) or not (0). Then, I would aggregate the player’s statistics from preceding games and use those as features.The idea is thus that a coach, medical support staff member, or even a player him- or herself, could then enter their accumulated statistics on a given day (the “intervention point”) into my model and see what the likelihood would be that playing on that day could precede an injury.

For a given game, the aggregate statistics from games preceding it are its feature values, and whether the player is injured immediately afterwards is its label.

Baseball fans will note from the statistics I have chosen in the example that I am focusing on starting pitchers. Other players have different statistics and would constitute an entirely separate machine learning problem. Pitchers are the most impactful choice for a first analysis anyway, both because they are often the most valuable players on a team and because the demanding nature of their task makes them highly susceptible to injury.

Having formulated a suitable question, the next step is data. Major League Baseball statistics are readily available, but records of injury events are harder to come by. Ultimately, I chose a list containing several thousand injury events from mlb.com’s transaction history. Each disabling injury results in a player being moved to the Disabled List, which is a transaction. Unfortunately, players being traded or moving up from the minor leagues are also transactions, so I used regex processing to generate a mostly clean list of about a thousand pitcher movements to the Disabled List. Spot checking revealed no irregularities; every event that passed through the regex filters was indeed injury-related.

Exploratory Data Analysis

It is usually a good idea to explore the data a bit. In my case, the well-structured nature of baseball and prior familiarity with the dataset had assured me that my data were relatively clean, so the most urgent question confronting me was whether game statistics in fact contained any predictive information at all in relation to injuries. I started with one of the simplest statistics of all: a player’s age at the time of the game preceding his injury (or non-injury). It seemed intuitive to me that older players would be more susceptible to injury; although in many careers, the early forties are a highly productive time, the extreme physical demands of baseball mean that few players can continue to perform at the professional level that long. Since injury is a failure mode associated with physical stresses, older players should have more injuries. Indeed, that is exactly what I saw (as figure below): relative to the “not injured” events, “injured” events are right-skewed. The effect size is modest, but because of the large number of events, statistically significant at p < 0.0001.

The light blue bars are the distribution of ages in games that did not precede an injury event, the red bars did precede an injury event, and the dark blue fractional bars are the overlap of the light blue and the red.

Many other statistics have similar correlations with injury. One of the most predictive and interesting is “Innings Pitched”, which is related to how long a pitcher spends in a game and how many pitches they throw.

The light blue bars are the distribution of innings pitched in games that did not precede an injury event; the dark blue fractional bars are the overlap of the light blue and the red. Note that the bins are not integer values of innings: since innings pitched is counted by the number of outs recorded when a pitcher leaves the game, there are twenty-eight possible values for innings pitched in a standard game. Plus, in five out of the 27,000 games examined, a pitcher pitched his entire nine innings and then continued pitching in overtime!

Surprisingly, a high number of innings pitched is a predictor that a player is relatively safe from injury; I had expected that, to the contrary, a high number of innings pitched would constitute overwork and lead to eventual breakdown and injury. One possible explanation for this counterintuitive finding is that nagging undetected conditions that might eventually result in injuries impede performance in the games preceding an injury; poor performance in turn leads to the coach benching the player. It is not the case that injury causes a lower number of innings pitched in that game, because the aggregation window for the features is separated by a full played game from transfer to the Disabled List (see initial image).

Feature Engineering

To hone the predictive power of my features, first I generated new features by applying different aggregation windows: for each player, I created separate features for each performance metric for one game preceding the intervention point, for the average of seven games preceding the intervention point, and for the player’s entire career. I also created separate features for the percent deviation of each single game value and seven-game average value from the career total.

Second, it was necessary to decorrelate the features. It should come as no surprise that many of the statistics a player generates during a game are highly mutually correlated. For instance, a player with a high number of innings pitched will also have a high number of pitches, a high number of outs, and higher numbers of the various particular types of outs. To avoid the cardinal machine learning sin of fitting a multicollinear set of features, I normalized each feature to an appropriate reference feature. For instance, I divided the number of groundball outs by the number of outs, and the number of hits by the number of batters faced. This method not only reduced multicollinearity; it also gave me more meaningful features. “Fraction of groundball outs” contains more information about a pitcher’s style than the total number of groundball outs, which could be high either because the pitcher was in the game for a long time or because they frequently throw groundball outs.

Additionally, I had one more aspect of pitchers’ performance that I wanted to account for: pitching style. The pages of baseball literature and commentary are filled with accounts of power pitchers, knuckleballers, sinkerballers, and more. For a relatively casual baseball fan like myself, it is difficult to draw consistent, distinct categories of pitching style from expert commentary or from the statistical data that I had already collected. And so I turned to natural language processing.

The player cards at BrooksBaseball.net contain descriptions of the types of pitches that a pitcher throws and how batters react to them. The stereotyped language is ideal for a first foray into natural language processing.

I located a reasonably complete database of the pitching styles of current pitchers and used standardized techniques to treat the descriptions as bags of words, lemmatize, and vectorize them. Having no strong preconceptions about how many pitching styles there might be, and given the limited time available, I turned to an extremely simple technique: K-means analysis.

Initial natural language processing (NLP) of text descriptions of pitching styles yields two broad categories separated based on the frequency of their sue of the indicated terms.

I projected the term frequency vectors I had created, which had a dimensionality on the order of the total number of terms present, onto a two-dimensional space using multidimensional scaling, which is meant to preserve the approximate relation of each of the pitcher descriptions to all of the others. Initially, I separated the descriptions into two means to see if there was any obvious topical difference between the terms associated with one of the means compared to the other. I did not see any, so I added in a third mean.

Adding a third mean does not change the relationship between the text feature vectors; it merely changes the classification of each vector.

Now, the means started to make sense. First, the observant reader will note that the third mean, colored purple, drew descriptions almost entirely from the population of descriptions that had previously been in the second mean. This indicates that the first two means were well separated in multidimensional space: if they had been close together or overlapping, we might have expected a third mean to draw points more equally from both initial means. Thus, we can be confident that we are separating the populations in a meaningful way.

Second, there is now a somewhat intuitive meaning to the terms associated with the three populations. The first has “flyballs”, the second has “groundballs”, and the third has “whiffs/swing”, which is a baseball term for a swing-and-a-miss, which usually leads to a strikeout. Thus, we have means associated with the three different types of out in baseball. These term associations were robust through several of the top terms associated with each mean: particularly for the groundball mean, the top three terms all contained the word “groundballs”. In the way that I set up the term frequency vectors, a single word can occur more than once because I accounted for the frequency of bigrams, or pairs of words occurring together, and trigrams as well as single words.

Modeling

Given the tight timeline, I chose to use random forest for a quick preliminary method to build up a base model. It offers several advantages for the problem I try to solve: 1) it doesn’t require labor-intensive feature scalings; 2) it is robust to find outliers; 3) It is sensitive to interactions between variables.

Model Optimization

I optimized the random forest hyperparameters to maximize the area under an ROC curve, which has two characteristics that make it better than accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets － and there are many more games preceding noninjuries in baseball than games preceding injuries － and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa. The area under an ROC curve metric does not require me to know in advance where I will set the threshold for identifying players at risk of injury.

I began by optimizing the random forest’s hyperparameters. The hyperparameters I focused on were the number of features each decision tree could choose from at each step in its creation and the maximum depth of those trees, or the total number of features that could be used in the classification of a single point. I used a grid search to explore all possible combinations of low integer values for these two hyperparameters, settling on an optimum value of three to four features for each. I also optimized the number of decision trees in my random forest; although I saw little increase in performance beyond 300 trees, I settled on 1,000 because compute time was not limiting and having redundancy within the forest would not be expected to harm model performance. This range agreed with various sources of expert advice.

The performance metric I chose to maximize with my grid search was area under the ROC curve, which has two characteristics that make it better than the standard accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets － and there are many more games preceding noninjuries in baseball than games preceding injuries － and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa. The area under an ROC curve metric does not require me to know in advance where I will set the threshold for identifying players at risk of injury. My ultimate area under the ROC curve for a withheld test set was 0.72, substantially below a perfect 1.00, but also substantially above the 0.50 that would be expected from random guessing. Considering that I did not know at the outset whether athletes’ statistical performance metrics would contain any at all information predicting injury, I was very happy with this result!

The ROC curve compares the performance of the random forest model to random guessing.

Model Validation

A final, and critical, step in any machine learning project is to prepare the model and findings for presentation or deployment in a way that is useful and meaningful to the intended audience. In this case, I wished the project to be persuasive and usable by all athletes, both those with PhDs in mathematics and those struggling to complete high school. Having provided a nice statistical metric conveying my model’s performance, I thought it would be useful to audiences of all backgrounds to have a graphical representation of the model’s performance. I chose to investigate whether the model generated an uptick in injury risk scores for individual players in the games leading up to a stint on the Disabled List. I picked four random players and calculated their injury scores for each game in the season they got injured.

José Contreras’s injury scores for his 2008 season. The precipitous decline early in the season is an artifact arising from the model’s use of aggregate statistics from prior games, which early games in the season do not have.

All four indeed displayed high injury scores leading up to the injury; above is presented the one with the sharpest uptick. Two other players had similar but mildly noisier trends, while the fourth player had a consistently high injury score.

A Web Application for Public Interaction with the Model

More than arguing for the model’s validity, I wished to offer baseball enthusiasts and professionals of all stripes an easy way to use and understand the model. I deployed my model on an AWS instance using Flask and Green Unicorn. I thus deployed my model on an AWS instance using Flask and Green Unicorn, making it available to the public at http://www.baseballinjurypredict.tech/.

In this screen capture, after a player or coach has entered their values for each of the feature that the model uses, the model outputs an assessment of the player’s risk of injury. The “injury score” output by the random forest model is notionally a probability of a particular set of feature values of indicating that an injury will occur, or more precisely the average of this probability across all of the decision trees in the forest, although depending on how one deals with the class imbalance in injury prediction problem, this interpretation is not necessarily correct.

Extract Insightful Information from the Model

The final stage of a machine learning problem is to produce a clear, useful, and interpretable result. To avoid forcing baseball players and coaches to deal with the intricacies of random forest output, the web application I designed compares the injury score for a given player’s input to all of the scores in the database used for the modeling and outputs the player’s injury score percentile, which should be readily understandable to many people.

Some users may distrust what seems like a data science black box, and to provide more persuasive analysis or explanation, I also use nearest neighbors analysis to identify games similar to the user’s entered values. The application presents an equal number of games that resulted in injuries and did not, with the idea that the user can evaluate by eye how similar his own feature values are to those in each class; moreover, the nearest neighbors analysis offers some insight into which of the features may be driving the random forest’s output in this particular case.

My modeling of pitcher injury risk raises several interesting questions. In particular, the anticorrelation between innings pitched and injury risk, which persists over all aggregation windows, is an intriguing finding that the expertise of professional athletes or coaches might be able to clarify. More importantly, it offers a proof-of-concept for the possibility of rationally weighing a player’s age and his recent and long-term performance characteristics to assess injury risk. Such modeling can help not just professional players, but also youth and amateur athletes across the globe without access to the same level of medical scrutiny as the professionals.