by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or tuning parameters that cannot be directly estimated from the data. For example, when using K-nearest neighbor model, there is no analytical estimator for K (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for K and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. x ’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the caret package via a function called SLC14_1 . First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

> library(caret) > set.seed(7210) > train_dat <- SLC14_1(250) > large_dat <- SLC14_1(10000)

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as sigma . Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:





There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, caret ’s train function defines wide as cost values between 2^c(-5, 10) and sigma values inside the range produced by the sigest function in the kernlab package. This code will do 20 random sub-models in this range:

> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5, + search = "random") > > set.seed(308) > rand_search <- train(y ~ ., data = train_dat, + method = "svmRadial", + ## Create 20 random parameter values + tuneLength = 20, + metric = "RMSE", + preProc = c("center", "scale"), + trControl = rand_ctrl)

> rand_search

Support Vector Machines with Radial Basis Function Kernel 250 samples 20 predictor Pre-processing: centered (20), scaled (20) Resampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 226, 224, 224, 225, 226, 224, ... Resampling results across tuning parameters: sigma C RMSE Rsquared 0.01161955 42.75789360 10.50838 0.7299837 0.01357777 67.97672171 10.71276 0.7212605 0.01392676 828.08072944 10.75235 0.7195869 0.01394119 0.18386619 18.56921 0.2109284 0.01538656 0.05224914 19.33310 0.1890599 0.01711920 228.59215128 11.09522 0.7047713 0.01790202 0.78835920 16.78597 0.3217203 0.01936110 0.91401289 16.45485 0.3492278 0.02023763 0.07658831 19.03987 0.2081059 0.02690269 0.04128731 19.33974 0.2126950 0.02780880 0.64865483 16.52497 0.3545042 0.02920113 974.08943821 12.22906 0.6508754 0.02963586 1.19350198 15.46690 0.4407725 0.03370625 31.45179445 12.60653 0.6314384 0.03561750 0.04970422 19.23564 0.2306298 0.03752561 0.06592800 19.07130 0.2375616 0.03783570 398.44599747 12.92958 0.6143790 0.04534046 3.91017571 13.56612 0.5798001 0.05171719 296.65916049 13.88865 0.5622445 0.06482201 47.31716568 14.66904 0.5192667 RMSE was used to select the optimal model using the smallest value. The final values used for the model were sigma = 0.01161955 and C = 42.75789.

> ggplot(rand_search) + scale_x_log10() + scale_y_log10()





> getTrainPerf ( rand_search )

TrainRMSE TrainRsquared method 1 10.50838 0.7299837 svmRadial

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and sigma . Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and sigma parameters. The Bayesian machinery allows of this prediction to have a distribution; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:



