Missing Data, XGBoost, and R — Part 2

An adventure in “know your tools”

Intro

In the previous post, we introduced some ways that R handles missing values in a dataset, and set up an example dataset using the mtcars dataset.

In this post, we explore training XGBoost models on this data. We represent the data in three different formats:

Dense matrix, missing values left as NA , passing missing = NA to the xgboost training function.

, passing to the training function. Sparse matrix.

Dense matrix, missing values filled in with 0.

We’ll see how XGBoost will return different predictions for the same data, depending on if we pass in NA s, sparse matrices, or regular dense matrices. This is a problem, as these transformations should not impact what our predictions are.

XGBoost Round 1: Dense Matrix, Missing=NA

We split our XGBoost matrix up into training and testing sets, along with our label/target vector, and train our first XGBoost model on a dense matrix.

XGBoost Round 2: Sparse Matrix

This time we turn the dense matrix into a sparse matrix. We’ll use the Matrix package to do this.

Let’s compare the predictions from our first model — built on a dense matrix — to the predictions from our second model, which was built on the same data, but sparse-ified.

Err…they should be exactly the same (all should lie exactly on the dotted line). What’s going on here?

After some diving through the xgboost source code, we see the offending culprit:

What this tells us is that XGBoost calls two different C++ functions depending on the input type of the data. Ultimately, xgboost needs an object of class xgb.DMatrix to build the boosted trees off of, and these two methods, XGDMatrixCreateFromMat_R and XGDMatrixCreateFromCSC_R , form the xgb.DMatrix differently, depending on whether we’ve passed in a dense or sparse matrix.

XGBoost Round 3: Dense Matrix, 0-Filled

Previously, we passed in a argument missing = NA . Now, we’ll fill in these missing values with a 0 — this is a legitimate way to clean this data, as it tells us that the row was not any of these factor levels. Again, plot the results against the results from the first XGBoost model.

What’s this one outlier that doesn’t agree between the two models?

test_v1[abs(preds_v1 - preds_v3) > 0.5,]

# cyl4 cyl6 cyl8 gear3 gear4 gear5 carb1 carb2 carb3 carb4 carb6 carb8

# 0 0 1 0 0 1 NA NA NA NA NA NA

We see it’s precisely a row with many NAs where the predictions between the models differ. This tells us that the missing = NA is not doing what we think it’s doing.

Wrap-up

What we’ve seen here is the following: let’s say we have a numeric matrix of features X and a label column y . If X has missing values (represented by NA ), then the following three matrices will return different predictions:

X as-is.

as-is. X with the categorical NA s filled in with 0 .

with the categorical s filled in with . X as a sparse matrix.

It was a complicated journey to understand exactly how these predictions differ, but it shows the importance of truly understanding your tools.

If you’re looking for a cool place to dig into R internals and hang with a great team, check us out, at RedVentures.com.