by Zachary Binney

Hundreds of analysts (a.k.a. football nerds) are rolling up their sleeves this fall to use NFL tracking data – which contains the location of all 22 players plus the ball measured 10 times per second – to try and predict how many yards a running play will gain as part of the NFL’s Second Annual Big Data Bowl.

While the tracking data allows for much more sophisticated analyses, two potentially valuable predictors for those not wanting to get their hands too dirty are the offense’s personnel and the number of defensive “men in the box” (MIB) close to the line of scrimmage. Indeed, within hours of the Big Data Bowl’s data being posted Twitter user @deryck_cg1 had already investigated how rushing yards vary with all combinations of offensive personnel and MIB. There is a vigorous discussion in the NFL analytics community about the relative importance of these in determining rushing success.

One tempting approach is to throw both in a regression model and see which one has a stronger association. While this is a fine approach for a prediction problem like the Big Data Bowl, it can cause some major problems depending on the particular question you’re trying to answer. To understand why you’re going to need a (basic) crash course in the field of causal inference. I’ll be gentle, I promise.

Note that this entire article is an oversimplification that omits a lot of important variables that can impact offensive and defensive personnel choices and rushing results and doesn’t use the best measure of rushing success. It is intended only as a tutorial that demonstrates how some seemingly correct analyses can be wrong. It is not a standalone comprehensive analysis. Also, the code used to produce all results in this article is available here, but the underlying data is proprietary. You can, however, apply this code to the Big Data Bowl data to conduct a similar analysis.

Situational Awareness

In order to help us sort out our thoughts about offensive personnel, MIB, and rushing success we’ll begin by drawing something called a directed acyclic graph (DAG) to represent our situation. A “graph” here just means we’re drawing some points connected by lines; “directed” means those lines are arrows; and “acyclic” (no cycle) means you can’t follow the arrows to get back to where you started from. The arrows imply that the thing at the tail of the arrow causes the thing it's pointing at. If you can read a play diagram, you can read a DAG.

Here is a DAG diagramming our situation (graphic generated using daggity):

DAG 1:

What’s going on here? Let’s start with the 3 individual arrows:

Offensive personnel has some direct effect on rushing success. This could be, for example, the effect of the offense having more or fewer blockers available to help on the play.

Offensive personnel impacts MIB because the defense reacts to what the offense shows, stacking the box or backing off.

MIB impacts rushing success. Fewer MIB means more rushing yards.

The three green arrows represent two (hypothesized) causal paths whereby offensive personnel impacts rushing success: a direct effect , and an indirect effect that “flows through” MIB. In other words, our diagram says offensive personnel impacts rushing success both by a.) altering the options and blockers for the offense (direct effect), and b.) altering the response of the defense in terms of MIB (indirect effect).

Another way to say the latter is that MIB mediates offensive personnel’s effect on rushing success.

Drawing a DAG is a helpful first step as it forces you to think about how all your variables relate to one another. As you’ll see below it can stop you from making some major analytical mistakes.

Questions We Could Ask

There are three types of questions we can ask of any dataset: descriptive, predictive, and causal. Here are examples of each for NFL rushing success:

Descriptive: What is the distribution of rushing yards on plays out of 11 personnel? 22 personnel? Against 6 MIB? 8 MIB?

Predictive: How many yards am I likely to gain in the future if I run 11 personnel against 6 MIB? 11 personnel against 8 MIB? 22 personnel against 6 MIB? This is the type of question being asked in the Big Data Bowl.

DAGs don’t matter here. We just use whatever data we have on hand to make the best predictions we can.

Causal: How many more or fewer yards can I expect to gain on this play if I run it out of 11 rather than 12 or 13 personnel?

DAGs are critical here to ensure we don’t mess up our analysis.



Let’s tackle each of these questions in order.

Descriptive: Visualizing the Data

This is always a good place to start. Let’s look at the actual distribution of rushing yards under various situations. We used data from Football Outsiders, Sports Info Solutions, and ESPN on 17,622 non-scramble rushes outside the opponent’s 30-yard line in the first to third quarters, excluding the final 2 minutes of the first half, with 2+ yards to go and five offensive linemen on first through third downs in 2016-18.

Figure 1. Distributions of rushing yards by offensive personnel grouping.

Figure 2. Distributions of rushing yards by MIB.

There are modest differences in yards per carry (YPC) under different offensive personnel (12 and 13 personnel yielded about 4.3 YPC, 22 personnel 4.6 YPC, and 11 and 21 personnel 4.8 YPC). There is a somewhat clearer drop for each additional MIB (5.8/4.9/4.5/4.3 YPC for <=5/6/7/8+ MIB).

Predictive: Combined Regression Model

In predictive modeling we use whatever data we have on hand to predict what will happen in the future. So a model with offensive personnel and MIB at the snap is fine for predicting that play's rushing success. Here are some predictions for average YPC on future first-and-10 plays using a (Bayesian) lognormal regression model (fit using the brms package in R):

Figure 3. Predicted average YPC by offensive personnel and MIB.

Figure 3 shows the distributions of predicted average YPC from our model. For example, for first-and-10 with 11 personnel vs. 6 MIB we would estimate about 4.85 YPC, with a 90% credible interval from 4.73-4.98 YPC (meaning we are 90% sure the true average YPC is in this range).

Here MIB is pretty clearly the stronger predictor of rushing success, and offensive personnel has a much smaller association. But what if we tweak our question slightly to something more valuable: what would happen if we change offensive personnel?

Causal: Offensive Personnel and Rushing Success

There is a theory that switching to (lighter, pass-friendly) 11 personnel could be more successful than (heavier, run-friendly) 12, 13, 21, or 22 personnel if it tricks the defense into expecting a pass.

So our head coach comes to us and asks, “Should we run out of 11 personnel more?” Coach is asking about the causal effect of changing offensive personnel. If we do X (change offensive personnel), will it change Y (YPC)?

In our model above we found that, against 6 MIB on first down, 11 personnel runs gained on average 0.13 more yards (90% credible interval -0.05 to +0.31 more yards) than 12 personnel runs. This difference is quite small. (The results were similar for 11 vs. 13 personnel.)

It is also the wrong answer.

Why?

We “controlled” or “adjusted" for – that is, removed the effect of – MIB by including it in our regression model. In DAG terms, here’s what happened:

DAG 2:

That box around MIB indicates that we controlled for it in our regression. That box “blocked” any effect of offensive personnel on rushing success that flows through MIB, so we only see its (smaller) direct effect – the single green arrow and causal path in DAG 2.

Another way of thinking about this is that we only see the effect of offensive personnel “conditional on” some specific number of MIB. For example, is running out of 11 personnel or 12/13 personnel more successful when facing 6 MIB? Or 8 MIB?

This doesn’t answer our coach’s question! It answers what would happen if we run out of 11 personnel but the defense doesn’t change its MIB. It doesn’t account for the full effect that offensive personnel has on rushing success because it explicitly ignores that offensive personnel can alter MIB. It’s an underestimate.

What Would Be Better?

Simpler is actually better here. If we run a model without MIB we will see offensive personnel’s total effect – a combination of its direct and indirect effects, represented by all the green arrows in DAG 1. In other words, we will get the total effect of having fewer blockers but also possibly tricking the defense.

Such a model shows that, against 6 MIB on first down, 11 personnel runs gained on average 0.43 more yards (90% credible interval +0.27 to +0.59 more yards) than 12 personnel runs. Whether switching to 11 personnel nets you 0.43 or 0.13 additional YPC is a major difference! (We see similar results for 11 vs. 13 personnel.)

To help us further understand the difference between the two models take a look at the brief animation below, which shows our estimated average YPC for 11, 12, and 13 personnel in the models that do and don’t control for MIB.

Figure 4. Predicted average YPC for 11, 12, and 13 personnel, controlling or not controlling for MIB (GIF).

In the model that controls for MIB there is only a modest difference between 11 and 12/13 personnel in terms of YPC. But when we don’t control for MIB, there’s suddenly a big gap! That’s because we include offensive personnel’s effect on MIB only in the latter model.

The difference between these two models is also an example of mediation analysis . By looking at the effect of offensive personnel on rushing success with and without controlling for MIB we can understand how much of the effect of offensive personnel on rushing success flows through MIB – in other words, how strong a mediator MIB is. The answer is: the vast majority of the difference in rushing success across different offensive personnel is due to changes in MIB, and offensive personnel appears to have a smaller if any direct effect on rushing success.

The results are even starker if we compare 11 personnel to 21 and 22 personnel. In the model that controls for MIB, 11 personnel looks worse – if we hold MIB constant, runs out of 22 have a higher YPC than runs out of 11. Duh, more big guys is better. In the model that doesn’t control for MIB, however, rushes are more successful from 11 personnel than 22 personnel!

Figure 5. Predicted average YPC for 11, 21, and 22 personnel, controlling or not controlling for MIB (GIF).

All that said, sometimes you only want an exposure's direct effect, meaning you actually want DAG 2. Consider an example of evaluating offensive line play. "Offensive line skill" has some direct impact on rushing yards as well as an indirect effect on rushing yards flowing through MIB since defenses may choose to stack the box against better run-blocking lines. Controlling for MIB would be wise if you want to estimate the true skill of an offensive line independent of its effects on MIB – that is, how one line compares to another when both face 6 MIB. But it may not be if you want to estimate that line’s total value – which should incorporate any effect they have on defenses choosing to stack the box.

Now, are you ready for one more twist?

What if I Want to Know the Effect of MIB Instead?

A defensive coach might ask “If we put another man in the box, will that help us stop the run?” Here, interestingly, you would want the first model we ran with both offensive personnel and MIB – not one with just MIB. Here’s the DAG for this situation:

DAG 3:

Notice MIB and offensive personnel switched places because MIB is now what we want to change – our exposure as epidemiologists call it. Offensive personnel is not part of how MIB affects rushing success (a mediator) – it’s what’s called a confounder instead. Offensive personnel causes both the thing we're interested in changing (MIB) and the outcome we want to measure (rushing yards).

Think of it this way: 7 MIB might look worse than 6 MIB just because 7 MIB is used more against heavier offensive personnel. The heavier personnel is said to confound the true effect of that extra MIB. If the 7 MIB caused the heavier offense, that would be part of its effect – there would be a green “causal” path of forward-pointing green arrows in DAG 3 that we would want to include in our analysis. But we’re assuming it didn’t – MIB is merely a response to offensive personnel, creating a red “non-causal path” that we want to “block.” A “non-causal” path is basically one that starts by going backwards up, rather than forwards down, an arrow.

A model with just MIB would give us a combination of the green path we want and the red path we don’t. Even if MIB had no effect on rushing success (no green arrow from MIB -> rushing success), this red path would make them appear associated unless we block it. So we need to control for offensive personnel to “block” the red “non-causal path” from MIB to rushing success to get a clear picture of what would happen if we added an extra MIB, holding offensive personnel equal.

This post is already too long, but a great resource for mediation versus confounding in DAGs is Dr. Miguel Hernan’s online course.

Are Offensive Personnel or MIB More "Important?"

MIB has the bigger effect in a model with both variables; end of story, right?

Yes and no. MIB has a stronger association with rushing success, and most of offensive personnel’s effect is due to how it changes MIB.

But.

We also must consider whether we have a predictive or causal question, our target audience (important for whom?), and – if our question is causal – how modifiable each variable is.

To the offense, offensive personnel’s effect is more important because it’s what they control – even if any effect is largely through its impact on MIB. For the defense, MIB is more important because it’s what they control (and it has the stronger association).

If you simply want to predict – but not change! – rushing success, MIB is also more important because it’s the stronger predictor. But if you want to talk about how to change rushing success – for example advocating for more 11 personnel or not – you’re back in causal territory and need to be more careful.

If you're interested in learning more about the differences between predictive and causal (a/k/a explanatory) modeling, a long but pretty readable paper is the classic "To Explain or to Predict" from Galit Shmueli.

Another Example from Public Health

We all agree smoking raises your risk of lung cancer. But how? Among other things the chemicals in cigarette smoke can cause mutations in the DNA of your lung cells, making them cancerous. The actual DNA mutation – if we could measure it – would show a stronger association with lung cancer than smoking simply because it’s closer in the “causal chain.”

So which is more important? The cancerous mutation is the immediate cause of the cancer and the stronger predictor of it, but we can’t snap our fingers and prevent that mutation. We can, however, stop people from smoking. In that sense smoking matters more despite “cancerous mutation” having a stronger effect.

Also consider what would happen if we looked at smoking while “adjusting” for cancerous mutations, such as by including both in a regression model. We would see no effect of smoking “conditional” on whether you did or did not have a cancerous mutation. That is, in those who did not have a cancerous mutation, whether they smoked would appear to have no effect on their risk of cancer. Same for those who did. But that does not mean that smoking is unimportant.

In this analogy offensive personnel is smoking, the lung DNA mutation is MIB, and cancer is rushing success. (You should pass more.)

Caveats

This post oversimplifies the game to focus on the main point: how to approach predictive versus causal questions. We did not consider any factors like player skill and team tendencies that would have substantial impacts on both offensive personnel and MIB choices. Nor did we consider different measures of offensive personnel, game flow, or any feedback between offensive and defensive choices.

We also know that all yards are not created equal – this is the whole idea behind DVOA. This analysis would be better with a metric, such as DVOA or EPA or WPA, that better captures the true value of each running play. But this article just uses YPC to keep things simple – again, it’s intended mainly as a tutorial, not a comprehensive analysis – and to be consistent with the Big Data Bowl, which asks participants to predict yards gained.

Finally, because rushing yards are right-skewed average YPC isn’t the best metric to use – some form of quantile regression, or simply comparing the full predictive distributions would be better. The latter is what the Big Data Bowl is smartly asking participants to do, but I've just presented YPC here for simplicity.

All analyses are adjusted for down-and-distance, though those are not written in the DAGs.

Conclusions

The take-home message here is to think carefully about the exact question – descriptive, predictive, or causal – you’re trying to answer and choose a suitable approach. Write it down. If it’s a causal question, identify the thing you want to change (exposure) and draw a DAG to make sure you do your analysis correctly – in most cases, controlling for confounders and avoiding mediators.

So should you throw both offensive personnel and MIB in a model and see what happens? It depends on your question! If you simply want to predict rushing success: sure. If you want to estimate the total effect of switching offensive personnel: no.