Exploratory Data Analysis

Before we start building the model, we need to understand both the response and predictor variables. The most effective way to do this is through visualisations, and what package is better to use than ggplot2 !

This section will contain some Exploratory Data Analysis (more commonly known as ‘EDA’), which is a natural first step in any data analysis project. It is an extremely important step and allows the analyst to inspect the data — the shape the target and predictor variables have, are there any correlations between the variables that may be included in a predictive model, are there any outliers in the data, or any missing data… you get the point, EDA is very important!

Well, there is no missing data in the dataset being used… that has all been cleaned in the pre-processing stage that can be found in this R code.

The Response Variable — Attendance

Let’s look at our response variable — or the thing we’re trying to predict — attendance at AFL premiership season games.

We can see that the attendance figures are somewhat positively skewed. We’ll see if the few higher attendance games cause any issues later on. This variable may need to undergo some form of transformation, but more on that later.

The median attendance to AFL games during the regular season is just under 32,500 for the 1,340 games played since 2013.

Some Key Predictors

data_for_model <- afl_premiership_season %>%

select(season, attendance, team1, team2, round, venue, game_time, home_median_experience, home_mean_experience, away_median_experience, away_mean_experience, home_300, home_milestone, away_300, away_milestone, count_milestones, rainfall_clean, min_temp, max_temp, last_five_days_rain, temperature_diff, HomeOddsOpen, AwayOddsOpen, HomeLineOpen, AwayLineOpen, odds_diff, rivalry_game, HomeTeamFav, team1_last_result, team2_last_result, split_round, home_team_finals_last_season, away_team_finals_last_season, season_stage, temperature, IsHomeMilestone, IsAwayMilestone, count_milestones, last_results, home_team_state, away_team_state, is_same_state, temperature_diff, home_odds_change, away_odds_change, home_line_change, away_line_change, second_home_game, home_season_points, home_score_for, home_percentage, home_ladder_pos, away_wins_last_three, away_season_points, away_score_for, away_percentage, away_ladder_pos, teams_in_eight, finals_last_season, home_wins_last_three, away_wins_last_three) %>% na.omit()

As was evident in the first post of this series, there was noticeable variation in attendances between the teams, and also whether teams won or lost their previous game. Additionally, the first post showed that the home team’s favouritism status also seemed to have a relationship with attendance.

Categorical Features:

Some of the key categorical variables created that we would expect to have an impact on crowd figures are displayed below.

Some observations:

There appears to be a relationship between the venue the game is played at and the attendance. Stadiums classified as Other include secondary home grounds and some of the ovals not used in 2019. There were 238 of the 1,340 games at these stadiums

include secondary home grounds and some of the ovals not used in 2019. There were 238 of the 1,340 games at these stadiums Whether either the home or away team played finals last year seems to have a weak relationship with attendance. It may be useful to include in our model

When the game is played and attendance appear to be related. Weekday Afternoon and Friday Afternoon are high because of ANZAC day clashes between Essendon and Collingwood no doubt. Friday and Sunday night games seem to draw better crowds. Could be a useful predictor

and are high because of ANZAC day clashes between Essendon and Collingwood no doubt. Friday and Sunday night games seem to draw better crowds. Could be a useful predictor There appears to be some variance in attendance figures between the different rounds during a season. Could be a useful predictor

Games classed as “Rivalry Games” clearly draw larger attendances than non-rivalry games. Could be a useful predictor

Whether the round was a split round (where some teams have a bye, or week off) or not doesn’t appear to be related to attendances. Might not be a useful predictor

There appears to be slightly higher crowds when the home team has a player playing a milestone (either a 100, 200, 250 or 300th game), but the relationship certainly doesn’t appear strong. Might not be a useful predictor

When the two competing teams are from the same state, crowds appear to be higher. Could be a useful predictor

To be expected, games where the home team is playing the game at their second home ground (Hawthorn and North in Tasmania, Melbourne, Western Bulldogs in Darwin, etc) draw lower crowds. This is probably more to do with lower stadium capacities, but may still be a useful feature

Numerical Features:

To explore the relationship between the numerical features and attendance , Pearson’s correlation will be used.

Using dplyr to select just the numerical features and then calculate the correlations on the converted matrix yields the below correlations with attendance .

numeric_features <- data_for_model %>% select_if(is.numeric)



numeric_features %>%

as.matrix() %>%

cor() %>% .[,"attendance"] %>% sort(decreasing = T)

While all correlations appear to be fairly weak, the following observations can be made:

The average experience ( home_mean_experience ) for the home team, calculated as the average games played by each player at each game they play in, yields the highest Pearson correlation with attendance of 0.264

) for the home team, calculated as the average games played by each player at each game they play in, yields the highest Pearson correlation with attendance of 0.264 The Home team’s percentage (scores for / scores against) also have a correlation above 0.2 at 0.236

The number of wins the home team had in their last three games had a Pearson correlation of 0.175, while the correlation of away team’s average playing experience was 0.159

Strangely, the home team’s ladder position had a negative correlation, with a Pearson correlation of -0.273. The away team’s ladder position also had a negative correlation of -0.150

Betting data shows some relationships; as the HomeOddsOpen increase, the attendance tends to decrease, which is somewhat expected. The odds_diff variable (the difference between the home teams opening odds and the away teams) also has a weak negative relationship (correlation -0.168)

increase, the tends to decrease, which is somewhat expected. The variable (the difference between the home teams opening odds and the away teams) also has a weak negative relationship (correlation -0.168) Weather data shows some relationships. As expected, the maximum temperature and rainfall, both on the day and the accumulation of last five day’s rainfall has a negative relationship with attendance (-0.091, -0.081 and -0.65 respectively)

Building the model

In any project that aims to build a machine learning model to predict the outcome of some response variable, it’s important to hold out a subset of your data when training the model. These two sets are generally referred to as training and test sets. The training set is what you pass to the model to build, and then the resulting predictions are made on the data the model hasn’t “seen”, to best mimic real life examples.

In this case, the model will be trained on data from the 2013 to 2018 seasons inclusive ( afl_pre_2019 ), with the predictions then made on 2019 data that the model hasn’t seen, afl_2019 . This is done to ensure the model doesn’t “overfit” the data. Models that overfit the data generally perform worse on unseen occurrences.

Things that were attempted but not included in the model

The following features were thought to have some validity but when run through the model were found to have no predictive power:

The number of wins the teams had in their last three games didn’t contribute

The rainfall on the day had no predictive power (however the total rainfall for the five days leading up to game day did)

Milestone games made the model perform worse

What’s missing / would be good to have

There is some information that wasn’t available publicly (or I didn’t try to get) which may prove handy in any future modelling work. These include:

Ticket pricing for games during each season to explore the effect ticket pricing has on attendance

Whether tickets were given away to fans for games and whether that had an impact on attendances

Any social media posts / newspaper text analysis regarding teams / topical matters involving players playing in games, etc

afl_pre_2019 <- data_for_model %>% filter(season < 2019, season >= 2013)



afl_2019 <- data_for_model %>% filter(season == 2019)

Baseline Model

Before building a model to predict games, we want a baseline model to be able to compare to. The temptation would be there to use the average crowd for each game as a baseline, but the different venue capacities make this less than relevant. As a result, to calculate the baseline model, I will use the median attendance for each venue (taking the average of the venues labelled “Other” as a whole), for each home-away team combo. For example, Collingwood(H) vs Essendon(A) is different to Essendon(H) vs Collingwood(A). The median is used because of the slightly skewed nature of AFL crowds.

median_attendances <- afl_pre_2019 %>%

group_by(team1, team2, venue) %>%

summarise(baseline_attendance = median(attendance)) %>% ungroup()



baseline <- afl_2019 %>% select(team1, team2, venue, attendance) %>%

left_join(median_attendances, by = c("team1", "team2", "venue")) %>%

filter(!is.na(baseline_attendance))

Using the average attendances per ground as our baseline, an RMSE of 6,816 and a mean absolute error of 4,958 is achieved. Any model that is built from here needs to be better than this.

Transforming Response Variable

Because of the somewhat skewed nature of our response variable, it might be wise to undergo some sort of transformation to normalise it more. A log transformation is routinely used to “un-skew” data. Below is the result of log transforming the attendance variable.

It looks less skewed than the non-transformed variable, however (spoiler alert), the models performed worse with attendance being log transformed.

Feature Selection Using Stepwise Method

A popular method to select which features go into a model is using stepwise regression. This can either be performed:

forward, where variables are added one by one, with only those that improve the model’s explainability (R-squared) retained,

backward, which starts with all variables, then removes them if they don’t add value to the model’s explainability, or

both, which is an ensemble of both forward and backward feature selection.

Using stepwise in both directions maximised the model’s Adjusted R-squared measure at 0.87.

When performing linear regression, we want to ensure there is no collinearity between the variables. One way to asses this is to calculate the Variable Inflation Factor (VIF). This is made easy with the vif function in the car package. A VIF above 5 is considered to be too high.

While the round the game is played is just over 5 (5.7), we will keep this in, however odds_diff and home_score_for will be removed from the full model selected using stepwise and re-run.

full_model <- lm(attendance ~ ., data = afl_pre_2019)





# Set seed for reproducibility

set.seed(123)



step.model <- MASS::stepAIC(full_model, direction = "both",

trace = FALSE) car::vif(step.model)

# the backward selection model include a variable with hi collinearity (VIF = 12). Will recreate model without odds_diff

full_model_clean <- lm(attendance ~ team1 + team2 + round + venue + game_time +

home_milestone + rainfall_clean + min_temp + max_temp + last_five_days_rain +

HomeOddsOpen + AwayOddsOpen + HomeLineOpen +

rivalry_game + split_round + last_results + is_same_state +

home_season_points + away_season_points +

away_ladder_pos + teams_in_eight + home_team_finals_last_season,

data = afl_pre_2019)





car::vif(full_model_clean)

The cleaned model’s Adjusted R-squared is 0.8652, while the RMSE was 6,474 and MAE 4,855. These results better the baseline model, but only just.

Best model

Determining the “best” model is an interesting exercise. If the explainability powers of the model is what we’re after, then we want to maximise the Adjusted R-squared metric. This analysis is aimed at maximising the success of predicting crowds, and as such, the Root Mean Squared Error (RMSE) or Mean Absolute Error (MAE) are the metrics we’re going to focus on. Both measures can be related back to the response variable — attendance — the difference being that MAE is the average absolute error of the model (difference between prediction and actual), while the RMSE penalises the model the larger the error.

For both RMSE and MAE, the below model results in the lowest RMSE and MAE when applied to the 2019 season games.

set.seed(123)



fit_lm <- lm(attendance ~ team1 + team2 + round + venue + game_time + I(HomeLineOpen * home_ladder_pos) + rivalry_game + last_results + last_five_days_rain + max_temp + min_temp + home_mean_experience + away_mean_experience + is_same_state + home_score_for + I(home_percentage * home_ladder_pos) + I(home_wins_last_three * round) + round * finals_last_season, data = afl_pre_2019)



summary(fit_lm)$adj.r.squared

The Adjusted R-Squared of the model is 0.858, indicating that almost 86% of the variability in AFL attendances in the training set (2013–2018) is explained by the model.

Interaction Terms

Features were not only used on their own in this model. Interaction terms were also employed. The following features were combined to form a feature and gave the model a boost in predictive power:

I(HomeLineOpen * home_ladder_pos)

I(home_percentage * home_ladder_pos)

I(home_wins_last_three * round)

round * finals_last_season

Analysing the best model

The first thing we want to do is ensure the residuals are normally distributed to satisfy the condition of normality for linear models.

The below histogram suggests that the errors are fairly normally distributed and centred around zero. A good sign.

ggplot(data = data.frame(fit_lm$residuals), aes(x= fit_lm.residuals)) +

geom_histogram() +

ggtitle("ERRORS ARE CENTRED AROUND ZERO") +

scale_x_continuous(labels = comma, name = "Model Residuals")

Then we can look at the coefficients of each of the predictor variables.

To make inspecting these easier, the broom package was used. Using the tidy() function, the results of the summary lm output are nicely displayed in a DF.

The coefficients for each of the predictor variables (other than home and away teams) are displayed below.