1. Ball Possession Team: This binary feature captures whether the home or the visiting team has the ball possession

2. Score Differential: This feature captures the current score differential (home – visiting)

3. Timeouts Remaining: This feature is represented by two independent variables – one for the home and one for the away team – and they capture the number of timeouts remaining for each of the teams

4. Time Elapsed: This feature captures the time elapsed since the beginning of the game

5. Down: This feature represents the down of the team in possession

6. Field Position: This feature captures the distance covered by the team in possession from their own yard line

7.Yards-to-go: This variables represents the number of yards needed for a first down

8. Ball Possession Time: This variable captures the time that the offensive unit of the home team is on the field

9. Ranking Differential: This variable represents the difference of the win percentage for the two team (home – visiting)

The model itself is based on a logistic regression model. While for a game of infinite duration a linear model could be a very good approximation, the finite duration of the game creates non-linearities, especially towards the end of the game. For this reason, we use a logistic regression model for the first 57 minutes of the game, and a support vector machine model (with radial kernel) for the last 3 minutes of the game (the choice of the 57-3 is arbitrary, but it can be evaluated if one wants). The raw output of the support vector machine classifier is not a probability and hence, we use Platt’s scaling for obtaining a class probability. In a nutshell iWinrNFL is depicted in the following:

In order to train the model I collected all the play-by-play data from the past 8 regular seasons and extracted the featured needed. I interacted directly with the NFL API but there is an easier solution to use the nflscrapr library developed by the folks at the CMU sports analytics student club. This provided 2,048 regular season games and 338,294 snaps. The way to set up the training is to get every play and for each one of the plays to extract the state of the game, i.e., the corresponding features. Then the dependent variable for this instance will be 1 if the home team won the game eventually and 0 otherwise. I have also included in the model three interaction terms between the ball possession team variable and (i) the down count, (ii) the yards-to-go, and (iii) the field position variables. This is crucial in order to capture the correlation between these variables and the probability of the home team winning. The interpretation of these variables is different depending on whether the home or visiting team have the ball possession. These interaction terms will allow the model to better distinguish between the two cases. The logistic regression standardized coefficients are shown in the following table.