Intro Yesterday I gave a workshop on applied predictive modelling with caret at the 1st LSE Computational Social Science hackathon. Organiser privileges. I put together some introductory code and started a simple GitHub repo for the participants, so I thought I’d share it here as well. This is not supposed to cover all aspects of caret (plus there is already this), but more of a starter-pack for those who might be migrating from Python or another machine learning library like mlr . I have also saved the environment as caret.rdata , so that the participants can load it up during the workshop (insert harrowing experience about live coding) and follow through—that’s included in the repo too if you rather have a test run first.

The Data Let’s start by creating some synthetic data using caret . The twoClassSim generates a dataset suitable for binary-outcomes: dat <- twoClassSim(n = 1000, #number of rows linearVars = 2, #linearly important variables noiseVars = 5, #uncorrelated irrelevant variables corrVars = 2, #correlated irrelevant variables mislabel = .01) #percentage possibly mislabeled colnames(dat) ## [1] "TwoFactor1" "TwoFactor2" "Linear1" "Linear2" "Nonlinear1" ## [6] "Nonlinear2" "Nonlinear3" "Noise1" "Noise2" "Noise3" ## [11] "Noise4" "Noise5" "Corr1" "Corr2" "Class" The above chunk simulates a dataframe with 1000 rows containing 15 variables: Class: Binary outcome (Class)

TwoFactor: Correlated multivariate normal predictors (TwoFactor1, TwoFactor2)

Nonlinear: Uncorrelated random uniform predictors (NonLinear1, …, Nonlinear3)

Linear: (Optional) uncorrelated standard normal predictors (Linear1, Linear2)

Noise: (Optional) uncorrelated standard normal predictors (Noise1, … , Noise5)

Correlated: (Optional) correlated multivariate normal predictors (Corr1, Corr2) We can take a closer look at the variables using two packages: skimr and xray . Both have functions that provide a snapshot of your covariates in an easy-to-understand output: skim(dat) #unintended ## Skim summary statistics ## n obs: 1000 ## n variables: 15 ## ## Variable type: factor ## variable missing complete n n_unique top_counts ordered ## Class 0 1000 1000 2 Cla: 567, Cla: 433, NA: 0 FALSE ## ## Variable type: numeric ## variable missing complete n mean sd p0 p25 median ## Corr1 0 1000 1000 -0.078 1 -3.19 -0.76 -0.11 ## Corr2 0 1000 1000 -0.016 1 -2.91 -0.71 -0.014 ## Linear1 0 1000 1000 0.016 1 -3.29 -0.68 0.054 ## Linear2 0 1000 1000 0.023 1.02 -3.31 -0.66 0.053 ## Noise1 0 1000 1000 -0.022 1 -3.46 -0.71 -0.023 ## Noise2 0 1000 1000 -0.048 0.99 -3.52 -0.74 -0.042 ## Noise3 0 1000 1000 0.016 0.97 -3.11 -0.64 0.014 ## Noise4 0 1000 1000 -0.015 1.02 -3.48 -0.71 -0.028 ## Noise5 0 1000 1000 0.036 0.94 -3.03 -0.59 0.072 ## Nonlinear1 0 1000 1000 0.0086 0.58 -1 -0.49 0.038 ## Nonlinear2 0 1000 1000 0.49 0.29 7.8e-05 0.25 0.5 ## Nonlinear3 0 1000 1000 0.51 0.29 5e-04 0.26 0.52 ## TwoFactor1 0 1000 1000 -0.086 1.36 -4.28 -0.96 -0.085 ## TwoFactor2 0 1000 1000 -0.042 1.39 -4.63 -1.03 0.0095 ## p75 p100 hist ## 0.58 3.58 ▁▂▆▇▇▂▁▁ ## 0.67 3.57 ▁▂▅▇▆▂▁▁ ## 0.71 3.02 ▁▁▃▇▇▅▂▁ ## 0.69 3.88 ▁▂▅▇▇▂▁▁ ## 0.64 3.22 ▁▁▃▇▇▅▁▁ ## 0.61 3.39 ▁▁▅▇▇▅▁▁ ## 0.64 3.77 ▁▁▅▇▅▂▁▁ ## 0.67 3.17 ▁▁▃▇▇▅▁▁ ## 0.71 3.1 ▁▁▅▇▇▅▁▁ ## 0.5 1 ▇▇▇▇▇▇▇▇ ## 0.73 1 ▇▇▇▇▇▇▇▆ ## 0.76 1 ▇▇▇▆▇▇▇▇ ## 0.81 3.83 ▁▂▃▇▇▅▂▁ ## 0.89 4.79 ▁▁▅▇▇▃▁▁ You should also try out xray::anomalies(dat) and see which output you prefer. Because our data is synthetic, we have these nice bell curves and normal distributions that are harder to locate in the wild. Let’s split the data into train/test using an index of row numbers: index <- createDataPartition(y = dat$Class, p = .7, list = FALSE) training <- dat[index, ] test <- dat[-index, ] First, we supply the outcome variable y so that caret can take it into account when creating the split (in terms of class-balance). We use 70% of the data for training and hold out the remaining 30% for testing later. We want a vector instead of a list so we convey this to R by overriding the default behaviour. The actual splitting happens when we subset using the index we just created; the selected row numbers generate the training data whereas the rest goes to the test (using negative indexing).

trainControl The magic of caret happens in the control arguments. Default arguments tend to cater to regression problems; given our focus on classification, I only briefly mention the former here: reg.ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, allowParallel = TRUE) We now have a trainControl object that will signal a ten-k fold (repeated 5 times; so 50 resamples in total) to the train function. Classification controls require several more arguments: cls.ctrl <- trainControl(method = "repeatedcv", #boot, cv, LOOCV, timeslice OR adaptive etc. number = 10, repeats = 5, classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = "final", allowParallel = TRUE) There is a good variety of cross-validation methods you can choose in caret , which I will not cover here. classProbs computes class probabilities for each resample. We need to set the summary function to twoClassSummary for binary classification. Finally, we set save predictions to TRUE —note that this is not a classification-specific argument; we didn’t have it in the regression controls because we won’t be covering them here in detail. For future reference, there are several other useful arguments that you can call within trainControl . For example, you can evoke subsampling using sampling if you have class-imbalance. You can set seeds for each resample for perfect reproducibility. You can also define your own indices ( index ) for resampling purposes.

Model Fitting We’ll start with a place-holder regression example for completeness. You should always set the seed before calling train . caret accepts the formula interface if you supply the data later. Below, we arbitrarily select one of the linear variables as the outcome, and fit the rest of the variables as predictors using the dot indicator: set.seed(1895) lm.fit <- train(Linear1 ~ ., data = training, trControl = reg.ctrl, method = "lm") lm.fit ## Linear Regression ## ## 701 samples ## 14 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 630, 632, 632, 630, 631, 630, ... ## Resampling results: ## ## RMSE Rsquared MAE ## 0.9995116 0.01280926 0.8039185 ## ## Tuning parameter 'intercept' was held constant at a value of TRUE Probably not the most amazing \(R^2\) value you have ever seen, but that’s alright. Note that calling the model fit displays the most crucial information in a succinct way. Let’s move on to a classification algorithm. It’s good practice to start with a logistic regression and take it from there. In R, logistic regression is in the glm framework and can be specified by calling family = "binomial" . We set the performance metric to ROC as the default metric is accuracy. Accuracy tends to be unreliable when you have class-imbalance. A classic example is having one positive outcome and 99 negative outcomes; any lazy algorithm predicting all zeros would be 99% accurate (but it would be uninformative as a result). The Receiver Operating Characteristic—a proud member of the good ol’ WWII school of naming things—provides a better performance metric by taking into account the rate of true positives and true negatives. Finally, we apply some pre-processing to our data by passing a set of strings: we drop near-zero variance variables, as well as centring and scaling all covariates: set.seed(1895) glm.fit <- train(Class ~ ., data = training, trControl = cls.ctrl, method = "glm", family = "binomial", metric = "ROC", preProcess = c("nzv", "center", "scale")) For reference, you can also vectorise your x and y if you find it easier to read: y <- training$Class predictors <- training[,which(colnames(training) != "Class")] #Same logit fit set.seed(1895) glm.fit <- train(x = predictors, y = y, trControl = cls.ctrl, method = "glm", family = "binomial", metric = "ROC", preProcess = c("nzv", "center", "scale")) glm.fit ## Generalized Linear Model ## ## 701 samples ## 14 predictor ## 2 classes: 'Class1', 'Class2' ## ## Pre-processing: centered (14), scaled (14) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 630, 630, 630, 631, 631, 631, ... ## Resampling results: ## ## ROC Sens Spec ## 0.8639053 0.8565385 0.7206237 You can quickly find out which variables contribute to predictive accuracy: varImp(glm.fit) ## glm variable importance ## ## Overall ## TwoFactor1 100.000 ## TwoFactor2 97.267 ## Linear2 73.712 ## Nonlinear1 25.964 ## Linear1 12.900 ## Nonlinear2 12.261 ## Corr1 8.758 ## Noise2 7.424 ## Corr2 6.116 ## Nonlinear3 5.798 ## Noise1 5.371 ## Noise3 4.306 ## Noise5 2.668 ## Noise4 0.000 plot(varImp(glm.fit)) Let’s fit a couple of other models before moving on. One common choice would be the elastic net. Elastic net relies on L1 and L2 regularisations and it’s basically a mix of both: the former shrinks some variable coefficients to zero (so that they are dropped out; i.e. feature selection/dimensionality reduction), whereas the latter penalises coefficient size. In R, it has two hyper-parameters that can be tuned: alpha and lambda. Alpha controls the type of regression; 0 representing Ridge and 1 denoting LASSO (Least Absolute Shrinkage and Selector Operator) . Lambda, on the other hand, determines the penalty amount. Note that the expand.grid function actually just creates a dataset with two columns called alpha and lambda, which are then used for the model fit based on the value-pairs in each row. set.seed(1895) glmnet.fit <- train(x = predictors, y = y, trControl = cls.ctrl, method = "glmnet", metric = "ROC", preProcess = c("nzv", "center", "scale"), tuneGrid = expand.grid(alpha = 0:1, lambda = seq(0.0001, 1, length = 20))) Because it has tune-able parameters, we can visualise their performance by calling plot on the model fit: plot(glmnet.fit) where the two colours denote the alpha level and the dots are the specified lambda values. Finally, let’s fit a Random Forest using the ranger package, which is a fast C++ implementation of the original algorithm in R: set.seed(1895) rf.fit <- train(Class ~ ., data = training, trControl = cls.ctrl, method = "ranger", metric = "ROC", preProcess = c("nzv", "center", "scale")) confusionMatrix(rf.fit) ## Cross-Validated (10 fold, repeated 5 times) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction Class1 Class2 ## Class1 51.2 7.2 ## Class2 5.4 36.1 ## ## Accuracy (average) : 0.8739 This is all good, as we are fitting different algorithms using a simple interface without needing to memorise the idiosyncrasies of each package. However, we are still writing lots of redundant code. The caretEnsemble package provides this functionality via caretList : set.seed(1895) models <- caretList(Class ~ ., data = training, trControl = cls.ctrl, metric = "ROC", tuneList = list(logit = caretModelSpec(method = "glm", family = "binomial"), elasticnet = caretModelSpec(method = "glmnet", tuneGrid = expand.grid(alpha = 0:1, lambda = seq(0.0001, 1, length = 20))), rf = caretModelSpec(method = "ranger")), preProcess = c("nzv", "center", "scale")) We basically just merged the first three model fits into a single call using tuneList , which requires a list of model specifications. If we want to predict using unseen data, we can now get predictions from all three models: models.preds <- lapply(models, predict, newdata = test) #add type = "prob" for class probabilities models.preds <- data.frame(models.preds) head(models.preds, 10) ## logit elasticnet rf ## 1 Class1 Class1 Class1 ## 2 Class2 Class2 Class2 ## 3 Class1 Class1 Class2 ## 4 Class2 Class2 Class1 ## 5 Class1 Class1 Class2 ## 6 Class2 Class2 Class2 ## 7 Class2 Class2 Class2 ## 8 Class1 Class1 Class2 ## 9 Class1 Class1 Class1 ## 10 Class1 Class1 Class1 The resamples function collects all the resampling data from all models and allows you to easily assess in-sample performance metrics: bwplot(resamples(models)) #try dotplot as well Averaged over all resamples, the Random Forest algorithm has the highest ROC value, however the whiskers overlap in all three categories—perhaps a larger number of resamples are needed for significant separation. It also outperforms other two algorithms when it comes to detecting true positives and true negatives. Note that often the results will not be this clear; it’s common for an algorithm to do really well in one area and perform terribly in the other. We could also create a simple linear ensemble using the three model fits. You can check whether the model predictions are linearly correlated: modelCor(resamples(models)) ## logit elasticnet rf ## logit 1.0000000 0.9996123 0.1135033 ## elasticnet 0.9996123 1.0000000 0.1141322 ## rf 0.1135033 0.1141322 1.0000000 Seems like logit and elastic net predictions are more or less identical, meaning one of them is redundant: xyplot(resamples(models)) And now for the ensemble: set.seed(1895) greedy_ensemble <- caretEnsemble(models, metric = "ROC", trControl = cls.ctrl) summary(greedy_ensemble) ## The following models were ensembled: logit, elasticnet, rf ## They were weighted: ## -4.9237 44.2633 -42.2176 8.0329 ## The resulting ROC is: 0.9517 ## The fit for each individual model on the ROC is: ## method ROC ROCSD ## logit 0.8618236 0.03919733 ## elasticnet 0.8614454 0.03923875 ## rf 0.9469568 0.02237265 The ROC of the ensemble (0.9517) is higher than any individual model, however the Random Forest algorithm by itself provides similar levels of accuracy (0.9469).