As different as ecology and the NFL sound, they share quite similar problems. The environment is an infinitely complex system with many known and unknown variables. The NFL is a perpetually changing landscape with a revolving door of players and schemes. Predicting an athlete’s performance pre-draft is complicated through a number of contributing variables including combine results, college production, intangibles, or how well that player fits a certain NFL scheme. Perhaps techniques that ecologists use to discern confounding trends in nature may be suitable for such challenges as the NFL draft.

My Ph.D. research area is aquatic eco-toxicology, where I primarily model chemical exposure hazards to fish. So essentially, I used the best available data and methods to quantify how much danger a fish may be in, in a given habitat. Chemical exposures occur in infinitely complex mixtures across many different environments, and distinguishing trends from such dynamic situations is difficult.

Prospective draftees are actually similar (in theory) in that they are always a unique combination of their college team, inherent athleticism, history, intangibles, and even the current landscape in the NFL. The myriad of variables present in the environment and the NFL, both static and changing, make it difficult to separate the noise from actual, observable trends.

In environmental science, we sometimes use non-traditional methods to help us visualize what previously could not be observed. Likewise, the “analytics community” tries to answer questions that traditional methods cannot. Although I am still a novice in this realm, I hope to educate others of the utility of ecological tools, namely Principal Component Analysis (PCA) in assessing NFL draft prospects.

The purpose of Principal Component Analysis (PCA) is to represent a data set containing many variables with a much smaller number of composite variables, or principal components. Think of the QB Rating (QBR). It is a composite variable in that it incorporates a number of other variables (Completion %, Yards, TDs, etc). PCA differs in that it places no bias on which variables it incorporates into the principal component. PCA only chooses the most compelling co-variation among variables, or the variables which explain the most variance between the sample units (i.e. Players). PCA can be useful in teasing apart what separates players in the NFL or college. By performing a PCA, you can get a sense for how similar each player or prospect is. When you do this in a historical sense, you may see similarities between Pro Bowlers…or draft busts. By deconstructing the PCA, you will see what measurements are highly correlated to the principal components (composite variable). Maybe 40 yard dash time or the vertical leap are highly correlated, and bench press is not (which is generally true). I use R Statistical Software to perform my PCA and highly recommend R for statistical work in general. I will outline an example of how I use PCA to assess the draft potential of defensive ends.

I collected combine data from http://nflcombineresults.com/ and college and professional statistics from http://www.sports-reference.com/ . The NFL data will represent our dependent variables (Y axis), or what we hope to eventually predict. The NCAA statistics and combine results will be our dependent variables and what we use to predict NFL success. I was able to gather enough quality data for 82 defensive ends. I organized this data in Microsoft excel and saved the data as a CSV file, which was imported into R.

I then played around plotting different variables against each other. For example, I plotted 40 yard dash and Bench Press reps against Career NFL sacks per game. You can see that 40 yard dash has a slight inverse relationship to career sacks per game, whereas bench press has virtually no relation. This is not surprising to most people even remotely knowledgeable about the NFL draft. Further, the 40 yard dash data is wedge shaped (grey shading). This suggests that 40 yard dash may cap the potential or ceiling of a pass rusher, but that there are clearly other variables present that limit NFL success.

Using only one variable at a time to predict a prospect’s NFL success does not exactly work, and frankly I don’t have the expertise to create an accurate and novel predictive metric. This is where ordination techniques, namely PCA, can be beneficial. The PCA that I will run will incorporate the following NCAA statistics and combine measurements into synthetic principal components:

Recall that a Principal Component is a synthetic variable, much like the QBR. It differs from QBR in that it looks for what variables explain the most variance between Players. So the principal components used here are variables that are statistically most important. Each original NCAA statistic or combine result has a specific loading, or correlation with the Principal Component. Here is what the loadings look like for Principal Component 1 (PC1):

PC1 includes all of these statistics and measurements, but at varying degrees of importance. Whether or not the correlation is negative or positive is irrelevant at the moment; we are only concerned with the magnitude. Variables shaded in grey have a 0.20 or greater correlation to PC1 and are really the only relevant measurements. Those not shaded are randomly occurring big plays with no significant explanation and non-normalized combine numbers. Notice that all the weight-normalized combine numbers except for bench press are relatively strongly correlated. Likewise tackles, sacks, and game experience are the most correlated NCAA stats. Logically, this makes sense. Athleticism seems to only matter within the context of the size of the athlete. Also, fluke plays in college (returns and TDs) don’t seem to matter much for a defensive end. So PC1 seems to be composed of measurements and stats that logically seem important for predicting a defensive end’s success in the NFL. But does it actually predict anything?

To test PC1 as a predictor of NFL pash rushing success, I plotted PC1 versus NFL Career Sacks per Game for these defensive ends.

Using the PC1 as a predictor, we did a little bit better job of improving the wedge-shaped nature of the data. We were able to better cluster our data compared to when we only used 40 yard dash. But it’s far from being able to predict anything confidently, and hardly impressive that we were only marginally better than using the 40.

Here it is important to note that PC1 only accounts for approximately 20% of all the variability between these prospects. That leaves 80% of the variance between the prospects left to be explained. Career sacks per game is by no means a complete metric of NFL performance, and advanced metrics like Career AV (sports reference) would assuredly be more appropriate. Also, merely weight-normalizing combine numbers probably isn’t the best way to quantify combine performance. But, perhaps utilizing metrics developed through the sports analytics community or approaches similar to the one by Chase Stuart , in conjunction with ordination techniques, we may better separate the noise from the trends. I hope this helps some of you add a new tool to your belt. If you need any help getting started with PCA feel free to email me at casan_scott@baylor.edu or casanscott@gmail.com.